GPU Acceleration (OpenMP target offloading)
ANUGA includes an experimental GPU backend that offloads the computationally intensive parts of the evolve loop — flux computation, extrapolation, friction, and momentum updates — to a GPU using OpenMP target offloading. No CUDA or Python GPU libraries are required.
Note
The GPU backend is experimental and under active development. The API
(set_multiprocessor_mode, operator support) may change in future
releases. For production runs where result reproducibility matters,
validate GPU output against mode=1 before switching.
Hardware and compiler requirements
Requirement |
Notes |
|---|---|
NVIDIA GPU |
Tested on NVIDIA GPUs. AMD ROCm GPUs supported in
principle via LLVM |
Compiler |
GCC 12+ with offload targets or LLVM/Clang with
|
Driver / toolkit |
CUDA toolkit matching your GPU driver version. |
OpenMP ≥ 4.5 |
Required for |
The ANUGA GPU extensions are built only when the compiler supports
-fopenmp-targets. On a standard conda or pip install the GPU extension
(sw_domain_gpu_ext) is absent; all GPU code paths fall back to the
CPU-OpenMP path automatically.
CPU-only mode (default)
When sw_domain_gpu_ext is not available (standard install) or when
OMP_TARGET_OFFLOAD=disabled is set, mode=2 still works correctly — it
runs the same C RK loop on the CPU without any device transfer. This is called
CPU_ONLY_MODE and is the mode used by the test suite in CI.
You can check which mode is active:
from anuga.shallow_water.sw_domain_gpu_ext import gpu_available
print(gpu_available()) # True only when a real GPU target is available
Quick start
The only change required to an existing ANUGA script is one call before the evolve loop:
import anuga
domain = anuga.rectangular_cross_domain(200, 200, len1=10000., len2=5000.)
domain.set_flow_algorithm('DE0')
# ... set quantities, boundaries, operators as normal ...
domain.set_multiprocessor_mode(2) # enable GPU mode
for t in domain.evolve(yieldstep=60., finaltime=3600.):
domain.print_timestepping_statistics()
set_multiprocessor_mode(2) initialises the GPU interface and switches the
evolve loop to a C-side Runge-Kutta loop that keeps all data resident on the
GPU between timesteps. Data is only transferred back to the host at each
yieldstep for Python I/O.
How it works
Python evolve() call
│
├─ map_to_gpu() host → device (once, at first yieldstep)
│
└─ for each yieldstep:
├─ C RK loop all kernels run on device (no CPU round-trips)
│ ├─ extrapolate_second_order_gpu
│ ├─ evaluate_boundary_gpu (all boundary types)
│ ├─ compute_fluxes_gpu (riverwall weir/orifice in-kernel)
│ ├─ update_conserved_quantities_gpu
│ ├─ protect_against_negatives_gpu
│ ├─ manning_friction_gpu
│ └─ [fractional step operators — see below]
│
└─ sync_from_device() device → host (once per yieldstep)
The data path is batched: a single H→D transfer at startup, a single
D→H transfer at each yieldstep. The inner timestep loop never touches host
memory, so GPU→CPU bandwidth is not a bottleneck.
Flow algorithm compatibility
The GPU backend supports the DE (discontinuous elevation) solver family only.
Algorithm |
GPU support |
|---|---|
|
✓ Full |
|
✓ Full |
|
✓ Full (SSP-RK3) |
|
✗ CPU only |
Use domain.set_flow_algorithm('DE0') (default) or 'DE1'/'DE2' for
GPU-accelerated runs.
Supported operators
Operators that are registered with the GPU domain execute entirely on the device. All other operators execute on the CPU via the standard fractional step mechanism; the domain is automatically synced to/from host before and after each CPU operator call.
Fully GPU-accelerated operators
Operator |
GPU implementation |
|---|---|
|
Batched; up to 64 instances |
|
Batched; up to 32 instances |
|
GPUCulvertManager; up to 64 total |
|
GPUCulvertManager |
|
GPUCulvertManager |
Riverwalls ( |
In-kernel (flux computation only) |
Operators that fall back to CPU
Operator |
Reason |
|---|---|
|
No GPU kernel yet |
|
No GPU kernel yet |
|
GPU via GPUCulvertManager (cross-boundary MPI supported) |
When a CPU fractional-step operator runs, ANUGA automatically syncs data from device before the operator executes and back to device afterwards. This adds one round-trip per operator call per timestep, so minimise the number of CPU operators when GPU performance is important.
Operator limits
The GPU domain uses dynamically-allocated arrays for operator registration,
so there is no hard limit on the number of Rate_operator, Inlet_operator,
or culvert operators. The arrays start at their default capacity and double
automatically as more operators are added.
The one remaining fixed limit is 64 triangles per inlet face — if a
culvert inlet region covers more than 64 mesh triangles, registration will
fail with a RuntimeError. Coarsen the mesh near the structure or split
the inlet region to stay within this limit.
Supported boundary conditions
All standard boundary types are supported in GPU mode:
Reflective_boundaryTransmissive_boundary,Transmissive_n_momentum_zero_t_momentum_set_stage_boundaryDirichlet_boundaryFile_boundary/Field_boundary(per-edge values pushed to device each sub-step)Time_boundary,Time_stage_zero_momentum_boundaryAbsorbing_wave_boundary(wave scalar pushed from Python; ghost state computed on device)Characteristic_wave_boundary(wave scalar pushed from Python; nonlinear characteristic kernel on device)Flather_external_stage_zero_velocity_boundary(exterior stage scalar pushed from Python; Blayo & Debreu characteristic decomposition kernel on device)
Custom boundary classes not in the above list are evaluated on the CPU with an automatic device sync.
Parallel (MPI + GPU)
The GPU backend is compatible with MPI domain decomposition. Each MPI rank uses one GPU device. Halo exchange between ranks uses CPU-side buffers; GPU-aware MPI (direct GPU-to-GPU transfer over NVLink/InfiniBand) is planned for a future release (G2.2).
# 4 MPI ranks, each using one GPU
mpiexec -np 4 python my_parallel_gpu_script.py
In the script each rank calls set_multiprocessor_mode(2) independently
after domain decomposition.
Checking the build and device
import anuga
# Check whether sw_domain_gpu_ext is present (required for mode=2)
try:
from anuga.shallow_water import sw_domain_gpu_ext
print("GPU extension available")
except ImportError:
print("GPU extension not available — running in CPU-only mode")
domain = anuga.rectangular_cross_domain(100, 100)
domain.set_flow_algorithm('DE0')
domain.set_multiprocessor_mode(2)
# Check device memory (prints estimate and device info if CUDA/HIP available)
# Raises RuntimeError if estimated memory exceeds device capacity.
# No-op in CPU_ONLY_MODE.
from anuga.shallow_water.sw_domain_gpu_ext import check_device_memory
check_device_memory(domain.gpu_interface.gpu_dom)
print(f"Mode: {domain.get_multiprocessor_mode()}") # 2
Performance tips
Use DE0 or DE1. The DE2 (SSP-RK3) scheme does 3× the kernel work per timestep in exchange for higher accuracy; only use it if accuracy requires it.
Minimise CPU fractional-step operators. Each CPU operator adds a full device sync round-trip. Prefer GPU-registered operators (see table).
Use large meshes. GPU occupancy is low for small meshes (< ~50 000 triangles). The GPU path is typically faster than OpenMP CPU for meshes larger than ~200 000 triangles on modern hardware.
Set OMP_NUM_THREADS=1 for GPU runs. The host-side RK loop is single-threaded in mode=2; extra OpenMP threads on the host compete for memory bandwidth without benefit:
OMP_NUM_THREADS=1 OMP_TARGET_OFFLOAD=mandatory python my_script.py
OMP_TARGET_OFFLOAD=mandatorycauses an error if the GPU target is not found, rather than silently falling back to CPU.Yieldstep interval. The D→H sync at each yieldstep is fast (~10 ms for a 1M-triangle domain), so yieldstep granularity has minimal impact on throughput.
Troubleshooting
ImportError: cannot import name 'init_gpu_domain'The
sw_domain_gpu_extmodule was not built. Rebuild with a compiler that supports-fopenmp-targets(GCC 12+ with offload targets or LLVM withlibomptarget).RuntimeError: GPU inlet face too largeA culvert inlet region covers more than 64 mesh triangles. Coarsen the mesh near the structure or split the inlet region.
RuntimeError: GPU device memory insufficientThe domain is too large for the available device memory. Use a smaller mesh, reduce the number of stored quantities, or use a GPU with more VRAM.
- Results differ between mode=1 and mode=2
Differences larger than ~1e-12 (machine epsilon for
float64) indicate a bug. Please open an issue at anuga-community/anuga_core with the minimal reproducer. Expected numerical difference is zero for CPU_ONLY_MODE and ≤ 1e-10 for real GPU hardware (due to non-associative floating-point reduction order).
See Also
OpenMP Parallelisation — CPU OpenMP threading (mode=1)
MPI Distribute Domain-first parallel workflow (distribute) — MPI domain decomposition