Parallelisation

ANUGA uses two main approaches to parallelisation: MPI-based domain decomposition (multiple processes, distributed memory) and OpenMP threading (multiple threads, shared memory). The two approaches can be combined.

An experimental OpenMP target-offloading GPU backend is under development in the sp26 branch — see OpenMP Parallelisation for details.

For long parallel runs, see Checkpointing for how to save and restart simulations from periodic checkpoints.

Choosing an MPI distribution strategy

Four functions are available for distributing a domain across MPI ranks. The right choice depends on mesh size and how you construct the domain.

Function

Rank-0 peak memory

When to use

distribute()

Full domain + quantities

Default choice. Rank 0 builds the complete Domain (mesh + all quantities), then sends submeshes to other ranks. Simple drop-in for serial scripts.

distribute_collaborative()

Full domain + quantities (shared)

Same interface as distribute but rank 0 broadcasts topology via MPI shared memory (one copy per node instead of one per rank). Use when several ranks share a node and rank-0 memory is tight.

distribute_basic_mesh()

Topology only

Rank 0 builds a lightweight Basic_mesh (no quantity arrays), distributes topology, then every rank sets its own quantities. Best when quantity arrays would exhaust rank-0 memory.

distribute_basic_mesh_collaborative()

Topology only (shared)

Like distribute_basic_mesh but topology is broadcast via shared memory. Use for the largest meshes where even per-rank copies of the topology are expensive.

Decision guide

Does rank 0 have enough RAM to hold the full Domain (mesh + quantities)?
│
├─ Yes ──► Do multiple ranks share the same node?
│           ├─ No  ──► distribute()
│           └─ Yes ──► distribute_collaborative()
│
└─ No  ──► Does rank 0 have enough RAM for topology only (no quantities)?
            ├─ Yes ──► Do multiple ranks share the same node?
            │           ├─ No  ──► distribute_basic_mesh()
            │           └─ Yes ──► distribute_basic_mesh_collaborative()
            └─ No  ──► Reduce mesh resolution or use more nodes

As a rough guide, a mesh of N triangles with P quantities requires approximately 8 × N × P bytes for quantity arrays alone on rank 0 (double precision). Topology (coordinates + connectivity) is ~56 × N bytes.

See also

ANUGA User Manual — Chapter 12: Parallel Simulation covers the MPI domain decomposition in depth, including ghost cell communication, scalability benchmarks, and HPC cluster setup.