GPU Documentation « NASA OVERFLOW CFD Code

Starting with OVERFLOW 2.5.0, OVERFLOW supports running on NVIDIA GPUs (tested on V100, A100, H100, and GH200). This version (2.5) is the first providing this capability. As such, there are many options, described below, that are not supported yet. While we have done extensive internal testing, and some testing by external beta users, you may still encounter issues (primarily with building and running the code on your system). If issues are encountered or help is needed, please reach out to overflow@lists.nasa.gov or larc-overflow-devs@mail.nasa.gov.

The GPU implementation uses a mix of OpenACC, CUDA Fortran, and CUDA C++ to run on the GPU. For more information about the development process and how the code is organized see AIAA 2024-0042. While many things have changed since then, the basics still apply.

Note: While running on the GPUs is relatively simple once everything is set up and working, getting to that state is not trivial. While we can help with specific issues related to building and running OVERFLOW, we will not be able to help you configure your computer systems properly. If you run into poor performance, missing compilers, or other issues with your system please reach out to your system administrators or hardware vendors.

Compiling:

*** If running on NASA's systems at Ames or Langley there are pre-built   ***
*** versions of the code that can be used. Please contact                 ***
*** larc-overflow-devs@mail.nasa.gov to get access to these modules. A    ***
*** signed Software Usage Agreement or Acknowledgement of Receipt is      ***
*** required in order to gain access. When using these pre-built versions ***
*** there is no need to worry about this section about compiling!         ***

We are currently only supporting NVIDIA GPUs and the NVIDIA compilers. The NVIDIA compilers (including older versions) are available for free as part of their HPC SDK, nvhpc, from https://developer.nvidia.com/hpc-sdk-releases. During the development process, many compiler bugs have been found, reported, and fixed (or worked around). Because of this, it is highly recommended to use a known good version of nvhpc. As of 5/7/2025 these include nvhpc 24.7, 24.11, and 25.1.

Note: There have been regressions in some versions of the NVIDIA compilers so so nvhpc versions greater than 25.1 are not guaranteed to work.

In order to run on multiple GPUs you will need to build with MPI. While not required, it is highly recommended that the MPI implementation be GPU-aware, sometimes referred to as CUDA-aware. Nearly all MPI implementations (including NVIDIA’s implementation shipped as part of nvhpc) are now are capable of this, but likely need to be configured specifically for your system to work correctly. Unfortunately, we will not be able to help you get this working for your systems, instead reach out to your system administrators or hardware vendor to see if they can help. If you do not have a GPU-aware implementation, the code can still run, but may be slower on two GPUs than one.

To build the GPU version of the code, follow the same CMake build process described in README.cmake.txt. However, an additional configure flag (-DGPU) is required to turn on the GPU code. If configuring with cmakeall, add the -g flag to turn on the GPU version of the code. So the cmakeall command would be:

cmakeall -g nvidia

If configuring by hand, just add -DGPU=True to your CMake configure command.

The compiler also needs to know what specific GPU architectures to build for. This is also specified during the CMake configure step through the -DCUDA_ARCH option. This option takes a semicolon separated list of architectures to build the executable for. Only specify the architectures that you actually want to target since adding more than one can significantly increase the compile time.

Here is a list of relevant architecture (compute capability) numbers for various GPU generations:

Volta (V100): 70
Ampere (A100): 80
Hopper (H100): 90

OVERFLOW does not support any GPUs before the V100 (compute capability 70). A more complete list can be found on NVIDIA’s website: https://developer.nvidia.com/cuda-gpus. If you do not specify this option at configure time, the code will compile for both V100 and A100 using -DCUDA_ARCH=”70;80″. You can also configure GPU version in cmakeall by supplying the -g option. For example to build for A100
and H100 you would run:

cmakeall -g 80 -g 90 nvidia

If you do not build with the correct compute capability the code will not be able to run and will usually report the compute capability it is expecting.

Note: Some files in the GPU path take a long time to compile, specifically some of the heavily templated C++ files. For this reason, parallel builds should always be performed. The build process may also appear to pause for a while as it waits for these files to finish building.

Note: If the build fails with errors like:

“NVFORTRAN-S-0034-Syntax error at or near …”
“NVFORTRAN-S-0000-Internal compiler error. check_member: member arrived with wrong derived type”

Ensure that you are configuring with -DCMAKE_BUILD_TYPE=”Custom”. If you are specifying your own flags (not recommended) then you need to ensure you have all of the flags needed for OpenACC, CUDA Fortran, and CUDA C.

You can run the GPU tests by running

ctest

in the build directory after building the code. These tests will check that the GPU version of various routines match the CPU version of those routines. All of the tests should pass, except for gpu_rhs_upwind which should be skipped (see comments in that file for why). These are also good checks to ensure you can properly run the code on a GPU. Keep in mind that these tests only run on a single GPU, so they do not test if you can run on multiple GPUs properly.

Note: Each GPU system seems to be unique. We have tried to ensure that the code works on all of the systems available to us (and some external systems). However, your system is probably different and we cannot provide specific support for your system.

Note: Compiling for GPUs is not trivial and there are lots of moving parts. There is the compiler version, the CUDA version distributed with the compiler (it is advised to download and install the nvhpc version that is bundled with previous CUDA versions for the most compatibility), and the CUDA version supported by your GPU driver. All of these versions need to match up (there is backwards compatibility, GPU with CUDA 12.2 can run code compiled with 11.0, but not the other way around, it is best if
everything matches).

Note: Since nvfortran sets the endian-ness at compile time you can either specify -byteswapio in your flags or set -DSWAPENDIAN=True in your CMake configure command. The default behavior is to not add this flag (little-endian on most modern systems).

Note: If you do not want to compile with GPU support just set -DGPU=False (and you can build the normal CPU version with any compiler).

Running:

The executables (overflow and overflowmpi) are capable of running both the CPU and GPU paths through the code. To use the GPU path, set USE_GPU=.T. in the &GLOBAL namelist.

A good rule of thumb for how many GPUs should be used to run a problem is approximately 10-15 million points on each GPU. This seems to be the sweet spot, ensuring that there are enough points to keep the GPU busy, while running on as many GPUs as possible. While more GPUs can be used there will be diminishing returns. At the other end of the scale, depending on the settings used, a single GPU can fit up to 2 million points per GB of GPU memory (so a 32GB V100 card can solve a 66 million point problem). However, if more expensive schemes are used, such as ILHS=6 with FSONWT=2.5, the GPU can only fit about 900 thousand points per GB of GPU memory.

Note: If you try to run a problem that is too big you will get errors that look like:

Out of memory allocating XXX bytes of device memory
… print out of present table contents …
Accelerator Fatal Error:
call to cuMemAlloc returned error 2
(CUDA_ERROR_OUT_OF_MEMORY): Out of memory

Single GPU:
To run on a single GPU you can use overrun or overflow just as you would in previous versions of the code.

Multiple GPU:
When running on multiple GPUs please use the overrunmpi_gpu script instead of the overrunmpi script. While the later may work on some systems there is likely other set-up required that overrunmpi_gpu performs. If running on multiple GPUs, OVERFLOW is set up to run with one MPI rank / GPU card. For example, if your system has 4 GPUs, you would want to launch 4 MPI ranks. This can be accomplished by running:

overrunmpi_gpu -n 4 case 1

overrunmpi_gpu launches another wrapper script in parallel (overgpu.sh). This script attempts to configure the run to properly pin the CPU ranks close to each GPU for optimal performance. It selects the GPU to run on based on the local MPI rank (local rank 3 gets GPU 3). Then it pins that MPI rank and OpenMP threads to the CPU that is attached to the same PCI bus as the selected GPU. This ensures that CPU to GPU communication is as fast as possible. To ensure that this script is doing the appropriate thing you can run test_gpu_pinning.sh (from tools/gpu/), following the instructions in the top of that file. If this script does not work on your system, please email larc-overflow-devs@mail.nasa.gov so it can be updated. If you want to change how the pinning works you can write your own wrapper script and set the environment variable OVERGPUEXE to this script when running overrunmpi_gpu. The overgpu_demo.sh script can be used as a starting point for such a script.

If you don’t want to use a wrapper script at all set OVERGPUEXE=”NONE”. Most slurm systems will do the appropriate pinning for you so you likely will not need this script.

Note: When trying to pin make sure to have access to all of the CPUs in the node. In PBS, for example, all of the CPUs should be requested. For example, in a node with 40 cores and 4 GPUs:
> qsub -lselect:ncpus=40:mpiprocs=4:ngpus=4
If the job does not have access to all of the numa domains the script will pin that MPI rank and the performance may be lacking for that case.

Note: If you want to run on multiple GPUs you will need to build with MPI (even if on the same node). The MPI implementation will need to be properly configured to be GPU-aware. We have set the various environment flags we are aware of in overrunmpi_gpu. If something else needs to be set in your environment please let us know so that we can add it to future versions of the code (or at least be aware of it if we get questions in the future).

An example test case is provided in tests/robin_sym/gpu.1.inp. See the README for that case for more information and performance characteristics of this case.

Supported Options:

Work is ongoing on the GPU version of the code. Thus, not all capabilities of the CPU path have been implemented on the GPU. The code will check the inputs before running and exit if you have requested an option that is not allowed. Instead of listing the options that aren’t ported to the GPU it is simpler to
describe what is:

IRHS = 0,3,4,5,6
ILHS = 2,6
FSO = 2,3,4,5 (Only integer values supported for IRHS=0)
IDISS = 2,3,4
IBTYPE = All except:
2,4,6,8,11-13,42,45,49,61,70,71,107,108,134,142,148,151,153,201
IGAM = 0
ISPEC = 2
NQT = 0, 102
NDIRK = 0
ICC = 0,1
Newton subiterations with any FSONWT, ORDNWT, NITNWT
Prescribed grid motion

The following are not currently supported on the GPU, but support is planned in
the future (but not necessarily in this order):

GLOBAL_LINEAR_SOLVE
ILHS=7,8,16,17,18,26,27,28
2D and axisymmetric grids (although the process for these may change)
Low-Mach preconditioning
CFL Ramping
QCR
Wall functions
Limiting for MUT
DES
Rotation Corrections
Turbulence Regions
Grid Sequencing
Geometric multi-grid
Species convection
CDISC
CL Driver
Gravity terms
Adaptation
Rotor coupling
DIRK schemes
SPLITM solution output

The following are not supported on the GPU (with no plans to support in the
future):

Symmetric grids with reflection planes
Different ITER,ITERT,ITERC values on different grids

More options will continue to be added in the future so please be patient. If
there are specific things (like different BCs) you would like, let us know and
we will put them on the list (no promises though).

Caveats:

There are several quirks of the code that behave differently on the GPU:

Because everything is happening in parallel, the results are not deterministic
and the solution might change very slightly run to run due to
order-of-operations, and other quirks. This difference should be very small
and on the order of round-off error.
Similarly, because the GPU path does things in a different order than the CPU,
it will never exactly match the result the CPU obtains but it should be very
close.
The upwind scheme does the face state limiting slightly differently on the GPU
compared to the CPU. This slightly less dissipative scheme is faster, but also
will give slightly different answers. If you need the same behavior (except
for the min-mod limiter) you can set -DREPRODUCING=True in your configure
step.
The CBCFIL routine on the GPU uses the same algorithm as the CPU when OVER_MP
and CACHE are defined. With -DREPRODUCING=True, the CPU path will use this
version of the algorithm.
The SSOR path uses a colored sweeping scheme that is different from how the
CPU path performs the sweeps. For most problems tested, the linear convergence
was not significantly affected and may just require a couple more sweeps to
achieve the same level of convergence.
The resid.out file (and similar ones) do not have the correct value for the
location of the max value (either J,K,L or X,Y,Z). This reduction operation is
not currently cheaply available on the GPU so it will report -99 J,K,L. The
rest of the output should be correct.
Many of the timers are not hooked up in the GPU path, but the ones that are
should be accurate.
The NVIDIA compilers will frequently end the simulation reporting that
different IEEE signals have been reported. We have been unable to find where
these are occurring and will show even when the simulation is working
correctly.

NASA OVERFLOW CFD Code

Navigation