Checkpointing
Long simulations can be interrupted unexpectedly — a power cut, a job-queue time limit on an HPC cluster, or a network failure. ANUGA’s checkpointing mechanism guards against losing all progress by periodically saving the full domain state to disk. If the run is interrupted, it can be restarted from the most recent checkpoint rather than from time zero.
How it works
At regular intervals (controlled by wall-clock time or by the number of
yield steps) ANUGA serialises the entire Domain object — quantities,
boundary conditions, operators, and simulation time — to a pickle file. On restart, that
pickle file is loaded and the evolve loop continues from where it stopped.
Note
The dill package is used when
available (it can serialise a wider range of Python objects than the
standard pickle module). Install it with pip install dill or
conda install dill for best results, especially when operators or
user-defined functions are attached to the domain.
Checkpoint files are named:
<checkpoint_dir>/<domain_name>_<simulation_time>.pickle
For parallel runs each rank writes its own file:
<checkpoint_dir>/<domain_name>_P<nproc>_<rank>_<simulation_time>.pickle
Enabling checkpointing
Call domain.set_checkpointing() after the domain is fully configured
but before the evolve loop:
domain.set_checkpointing(
checkpoint_dir='CHECKPOINTS', # directory to store pickle files
checkpoint_time=900, # save every 900 s of wall time (15 min)
)
Parameter |
Default |
Description |
|---|---|---|
|
|
Set to |
|
|
Directory where pickle files are written. Created automatically if it does not exist. |
|
|
Save a checkpoint every N yield steps. Ignored if
|
|
|
Save a checkpoint every N seconds of wall-clock time.
Overrides |
Use checkpoint_time rather than checkpoint_step for production runs
so that checkpoints are written at predictable wall-clock intervals
regardless of how long each yield step takes.
Restarting from a checkpoint
Use the try / except pattern shown below. On a fresh run the try block
fails (no checkpoint files exist) and falls through to the normal domain
setup. On a restart it succeeds and the evolve loop picks up from the last
saved simulation time.
import anuga
from anuga import load_checkpoint_file
DOMAIN_NAME = 'my_simulation'
CHECKPOINT_DIR = 'CHECKPOINTS'
try:
# Attempt to load the most recent checkpoint
domain = load_checkpoint_file(domain_name=DOMAIN_NAME,
checkpoint_dir=CHECKPOINT_DIR)
print(f'Restarting from checkpoint at t = {domain.get_time():.1f} s')
except Exception:
# No checkpoint found — build the domain from scratch
domain = anuga.rectangular_cross_domain(100, 100, len1=10.0, len2=10.0)
domain.set_name(DOMAIN_NAME)
domain.set_quantity('elevation', lambda x, y: -1.0 - 0.1 * x)
domain.set_quantity('stage', expression='elevation + 0.5')
domain.set_quantity('friction', 0.03)
Br = anuga.Reflective_boundary(domain)
domain.set_boundary({'left': Br, 'right': Br, 'top': Br, 'bottom': Br})
# Enable checkpointing — save every 15 minutes of wall time
domain.set_checkpointing(checkpoint_dir=CHECKPOINT_DIR,
checkpoint_time=900)
# Evolve — same code whether starting fresh or restarting
for t in domain.evolve(yieldstep=60.0, finaltime=86400.0):
domain.print_timestepping_statistics()
Parameter |
Default |
Description |
|---|---|---|
|
|
Base name of the domain (must match |
|
|
Directory to search for pickle files. |
|
|
Load the checkpoint at this specific simulation time. By default the most recent checkpoint is used. |
Parallel simulations
Checkpointing works with MPI parallel runs. Each rank reads its own
checkpoint file automatically — the _P<nproc>_<rank> suffix is appended
internally by load_checkpoint_file. The try / except pattern is the
same as in the serial case, but the domain setup in the except branch
must follow the normal parallel structure (distribute on rank 0 etc.):
import anuga
from anuga import distribute, myid, numprocs, finalize, barrier
from anuga import load_checkpoint_file
DOMAIN_NAME = 'my_parallel_simulation'
CHECKPOINT_DIR = 'CHECKPOINTS'
try:
domain = load_checkpoint_file(domain_name=DOMAIN_NAME,
checkpoint_dir=CHECKPOINT_DIR)
except Exception:
if myid == 0:
domain = anuga.rectangular_cross_domain(200, 200,
len1=10.0, len2=10.0)
domain.set_name(DOMAIN_NAME)
domain.set_quantity('elevation', lambda x, y: -1.0 - 0.1 * x)
domain.set_quantity('stage', expression='elevation + 0.5')
else:
domain = None
domain = distribute(domain)
Br = anuga.Reflective_boundary(domain)
domain.set_boundary({'left': Br, 'right': Br, 'top': Br, 'bottom': Br})
domain.set_checkpointing(checkpoint_dir=CHECKPOINT_DIR,
checkpoint_time=900)
barrier()
for t in domain.evolve(yieldstep=60.0, finaltime=86400.0):
if myid == 0:
domain.print_timestepping_statistics()
domain.sww_merge(delete_old=False)
finalize()
Practical tips
- Choose checkpoint interval wisely
For a run expected to take several hours, saving every 15–30 minutes (
checkpoint_time=900to1800) strikes a good balance between protection and disk overhead. Checkpointing too frequently can slow the simulation noticeably.- Checkpoint files accumulate
ANUGA does not delete old checkpoint files automatically. Each checkpoint is a separate pickle file. Clean up stale files manually once a run has completed successfully, keeping the
CHECKPOINTSdirectory from growing unbounded across multiple runs.- SWW output is unaffected
Checkpointing saves domain state for restart purposes only. The SWW output file continues to be written normally throughout the run, so results up to the interruption point are preserved in the SWW file even without a restart.
- Keep domain name consistent
load_checkpoint_filematches files bydomain_name. If you change the name between runs the checkpoint will not be found.
A working example including MPI checkpointing is in
examples/checkpointing/runCheckpoint.py.
See also
ANUGA User Manual — Chapter 13: Checkpointing discusses the checkpointing mechanism in more detail, including how to manage checkpoint files across long cluster runs.