Performance Cookbook
The performance cookbook part of the GROMACS best practice guide
assumes your simulations are prepared appropriately and provides
concrete guidance on how best to run GROMACS simulations, i.e. execute
mdrun
, so as to make good use of available hardware and obtain
results in the shortest time possible, be it on a laptop, a multi-GPU
desktop workstation, a departmental cluster, and especially on large
supercomputers. This complements and provides a bridge into navigating
the detailed information provided in the “getting good performance
from mdrun” page in the GROMACS manual:
http://manual.gromacs.org/current/user-guide/mdrun-performance.html
GROMACS can generally be launched without specifying anything other than the essential simulation parameters, as it has built-in heuristics that enable it to detect the underlying hardware and use accumulated insights about good performance embedded in the code to make usually reasonable choices given the available number and types of CPU cores and/or GPUs. By default GROMACS also adapts dynamically during execution to improve performance. However for any given simulation it is possible that better than default choices exist. Understanding how to control these by explicitly specifying parallel execution options and how best to approach obtaining optimal use of available hardware can make a significant difference to throughput and hence scientific results achieved over a given timespan, as well as to financial (cost) and environmental (energy usage) efficiency. For using high-performance computing resources there is additionally clear value in knowing what scale of resources (number of cores / nodes / GPUs) are efficient to use, or how to go about finding this out.
As well as general guidance applicable to whatever machine you may be running GROMACS on, the performance cookbook provides concrete examples showing how to obtain good performance on a number of specific PRACE and EuroHPC machines, both as an illustration of the application of general best practice process for obtaining good performance, and to promote efficient usage of the named machines. The cookbook also provides a reference set of (near) optimal benchmark performance results obtained on these machines using best practice in order to aid estimation of required compute time allocations for researchers requesting such time.
General guidance on running mdrun
and strategy for getting good performance
Single node
When using GROMACS on a single node, that is to say a machine such as
a laptop, desktop workstation, or server, where all processor cores or
GPUs have access to a single shared memory, we typically run the
thread-mpi version of mdrun (gmx mdrun
).
CPU only
Running GROMACS with the default of one thread-mpi rank per core on a
single-node CPU-only machine is often optimal, however one can
experiment with different combinations of numbers of thread-MPI ranks
(controlled by varying -ntmpi N
) and numbers of OpenMP threads per
rank (controlled by varying -ntomp M
and setting the environment
variable OMP_NUM_THREADS=M
), such that M x N
is equal to the
total number of CPU cores available on the node. Many processors
support simultaneous multithreading (SMT), known as hyperthreading on
Intel processors, whereby each physical core can efficiently execute
multiple threads or processes. Enabling multithreading may boost
performance. To make use of SMT, -ntmpi N
and -ntomp M
should
be chosen such that M x N
equals the number of logical cores
identified by the operating systems, which is equal to the number of
physical processor cores multiplied by the number of simultaneous
multithreads supported by each physical core.
If executing on more than at least 12 ranks GROMACS by default
dedicates a certain fraction of the ranks to PME calculations. The
number of PME ranks is based on various heuristics that reflect what
the developers expect to be optimal based on the underlying algorithms
and what has often been found to give good performance. During
execution GROMACS attempts to reduce the load imbalance between PP and
PME ranks, unless this functionality is disabled by choosing
-tunepme no
. Even when enabled however, this dynamic PME tuning
does not change the number of PME ranks from what GROMACS decides
according to its built-in heuristics. It is possible to explicitly
specify the number of PME ranks to use with the -npme
option. You
should examine the md.log file produced and check if there is any
warning suggesting to try use a larger or smaller number of PME ranks
for your simulation. A more systematic way to determine a
performance-optimal number of PME ranks for a given total number of
ranks can be determined using the gmx tune_pme
tool as described
in gmx tune_pme
For details, see Running mdrun within a single node and Process(-or) level parallelization via OpenMP
CPU + GPU
By default GROMACS launches one thread-mpi rank per GPU. It is worth
experimenting running with more thread-mpi ranks, up to one rank per
CPU core. Whatever the number of ranks, choose -ntomp
to ensure
all available CPU cores are used, and apply similar considerations to
adjust -ntomp
for hyperthreading/SMT-capable processors as
described above for CPU-only execution.
GPU offload options
By default GROMACS offloads short-range non-bonded force calculations
(-nb gpu
), PME calculations (-pme gpu
) and, on NVIDIA GPUs,
bonded force calculations (-bonded gpu
). For GROMACS 2022
PME offload to GPU can only be done by a single rank (i.e. a single
GPU). PME offload is also subject to a number of further known
limitations.
It is worth experimenting with offload options taking into account the
performance considerations for GPU tasks
and considering your hardware, in particular the number and respective
age and computational power of your processor(s) and your GPU(s). For
example, FFTs are offloaded by default if PME is offloaded (which also
happens by default), but this can be avoided with -pmefft cpu
and
may be beneficial for an older GPU sitting alongside newer CPU.
Constraint calculations and coordinate updates default to CPU but can
be offloaded with -update gpu
, though only to NVIDIA GPUs, only if
GROMACS is executes on a single rank, and (for GROMACS 2020)
subject to further limitations (no free-energy, no virtual sites, no
Ewald surface correction, no replica exchange, no constraint pulling,
no orientation restraints and no computational electrophysiology).
For details, see Running mdrun within a single node, Node level parallelization via GPU offloading and thread-MPI, and Running mdrun with GPUs
Multiple networked compute nodes in a cluster or supercomputer
In order to run GROMACS on a multi-node distributed-memory machine
such as a supercomputer we need to run the MPI-enabled version using
gmx_mpi mdrun
(or simply mdrun_mpi
), rather than gmx mdrun
(or simply mdrun
). In addition, GROMACS should be launched with a
parallel application launcher (mpirun
, mpiexec
, srun
, or
aprun
), which sets the number of MPI ranks N
:
mpirun -np N gmx_mpi mdrun <mdrun options>
CPU only
As for single-node simulations, running with 1 rank per core and 1 OpenMP thread per rank and therefore with as many MPI ranks per node as there are cores on each node is often optimal. It is however worth experimenting with multiple OpenMP threads per rank, especially as this often helps retain higher performance when scaling to more nodes.
Runs using multiple nodes are very likely to cause GROMACS to
automatically spawn dedicated PME ranks, for which you are advised to
follow the guidance already given above for the single node case. You
should be aware that if the number of PP ranks resulting from the
combined choice of total number of MPI ranks and number of PME ranks
has a largest prime divisor that GROMACS considers too large to give
good performance for the domain decomposition, it will throw a fatal
error and abort execution. In these cases it may be worth adjusting
the total number of ranks and/or the number of PME ranks to obtain
better performance, using OpenMP threading where necessary to ensure
all cores on each node are utilised, though it’s not inconceivable
that overall performance may be larger with an optimal choice of
number of PME ranks even if this leaves a small number of cores on
each node unused. As explained for the single-node case,
performance-optimal number of PME ranks for a given total number of
ranks can be determined in a systematic way using the gmx tune_pme
tool as described in gmx tune_pme
The OpenMP threading for PME ranks can be chosen to be different than
for standard (i.e. PP) ranks using the -ntomp_pme
option,
providing added flexibility to help utilise all cores on each node.
CPU + GPU
Broadly the same considerations as for single-node use apply with regards to determining an optimal choice of number of MPI ranks x OpenMP threads and GPU offloading options.
Offloading PME calculations to GPU can only
take place on a single rank using a single GPU (GROMACS 2022). The performance
advantage this typically gives over CPU-based PME computation on a
single or small number of nodes can therefore start to diminish when
running on larger numbers of nodes thanks to the growing compute power
of an increasing number of PME ranks running on CPU cores. One
therefore expects a crossover point beyond which it is faster on a
given number of nodes to run multiple dedicated PME ranks - e.g. a
number chosen optimally using `tune_pme`
- on CPU cores, instead
of offloading all PME calculations to a single GPU. The node count at
which this occurs will depend on factors such as the system size,
chosen MD parameters, and the relative performance of GPUs and CPUs
available in a machine.
General guidance for benchmarking
Before beginning expensive mdrun
simulations you should benchmark your
system to ensure you are using the optimal amount of HPC resources.
Usually this mean how many nodes/CPUs/GPUs you choose to use.
To do this you should take your system, run it for a short time (10,000 steps should be sufficient) and increase the number of CPUs. You then look at the performance figure in ns/day. It is helpful to plot a graph of CPU count vs performance. Examples of these are shown in the following section for different PRACE/EuroHPC machines.
You will see that for increasing CPU count you get diminishing returns on performance gains. It is up to you to choose a core count that balances the time you have to wait for the simulation to finish, and the cost of running the simulations. Often the appropriate number is the point on the performance curve before it starts to plateau.
Some mdrun flags that can help with benchmarking are:
-dlb yes
turns on dynamic load balancing which shifts particles between MPI ranks to optimize performance. This can interfere with the tunepme setting which will optimize various aspects of the PME and DD algorithms, shifting load between ranks.-notunepme
turns off PME load balancing because it can interfere with thedlb yes
setting. The PME settings can be tuned separately usinggmx tune_pme
.-noconfout
does not create the output conformation as this is not needed for benchmarking.-resethway
resets the performance timers halfway through the run, this removes the overhead of initialization and load balancing from the reported timings.-nsteps 10000
forces mdrun to run for only 10000 steps, this lets you run short benchmarks using your production ready tpr file.
Things to investigate:
Some systems will have different versions of GROMACS built by different compilers, these often have differing performance.
Some systems will have the ability to turn on Simultaneous Multi-Threading (SMT), it may provide a performance boost for GROMACS.
For large systems using many nodes then the use of Hybrid OpenMP/MPI cat offer increased performance over pure MPI.
GPU specific notes:
For GPU systems you can try different GPU offload scenarios.
For GPU systems it is generally best to have 1 MPI rank or 1 thread-MPI rank per GPU, but this may not always be the case.
For GPU systems the parallel efficiency beyond 1 GPU is often poor, you will not see the same strong scaling as on CPU only systems. However the performance on 1 GPU should be greater than on an equivalent CPU node.
Getting good GROMACS performance on PRACE/EuroHPC machines
This section provides guidance and concrete recipes showing how to build and run GROMACS for good performance on some of the largest EU-based supercomputers available to EU researchers through PRACE and EuroHPC.
General guidance for each machine is complemented by an analysis
illustrating the effect of key runtime execution choices on mdrun
performance for a range of benchmark simulations. Comparison between
these and your own simulations should help you determine how best to
run your own simulations on these machines and how to obtain good
performance.
Benchmarks
A brief description is provided below of the benchmarks used to illustrate how to obtain good performance on PRACE/EuroHPC machines for a range of system sizes and types.
Benchmarks prefixed with “bench” are available from the Dept. of Theoretical and Computational Biophysics at the MPI for biophysical Chemistry, Göttingen: https://www.mpibpc.mpg.de/grubmueller/bench
Benchmarks suffixed with “_HBS” are available from the UK’s HECBioSim consortium of computational biomolecular researchers: https://www.hecbiosim.ac.uk/benchmarks
- 20k_HBS:
3NIR Crambin
Total number of atoms: 19,605
Protein atoms: 642
Water atoms: 18,963
Input parameters: 20k_HBS.mdp
- benchMEM:
Protein in membrane, surrounded by water
Total number of atoms: 82k
Input parameters: benchMEM.mdp
- 465k_HBS:
hEGFR Dimer of 1IVO and 1NQL
Total number of atoms: 465,399
Protein atoms: 21,749
Lipid atoms: 134,268
Water atoms: 309,087
Ions: 295
Input parameters: 465k_HBS.mdp
- benchRIB:
Ribosome in water
Total number of atoms: 2M
Input parameters: benchRIB.mdp
- benchPEP:
Peptides in water
Total number of atoms: 12M
Input parameters: benchPEP.mdp
GROMACS performance on HAWK (HLRS, Germany)
https://www.hlrs.de/systems/hpe-apollo-hawk/
HAWK is listed as number 16 on the Top500 (November 2020), the 6th largest European HPC system, and is accessible through PRACE access mechanisms.
Hardware
Each node has:
Processors: 2 x 64-core AMD EPYC 7742 @2.25
Memory: 256GB RAM
Interconnect: InfiniBand HDR200
Software
Relevant software stack on system (available to all users via environment modules):
HPE MPT and OpenMPI MPI libraries
FFTW (Zen2 architecture-specific build)
Build
A multinode-capable MPI-enabled version of GROMACS with good performance on HAWK can be built as follows:
module load fftw
cd gromacs-2020.2
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=${HOME}/gromacs/2020.2 -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx \
-DGMX_MPI=on -DGMX_SIMD=AVX2_256 -DGMX_GPU=off -DGMX_BUILD_SHARED_EXE=off -DBUILD_SHARED_LIBS=off \
-DGMX_FFT_LIBRARY=fftw3 -DCMAKE_PREFIX_PATH=${FFTW_ROOT}
make -j 64
make install
Run
The example job script below shows how to run GROMACS on HAWK for 1 hour on 8 nodes with 128 MPI ranks per node and 2 OpenMP threads per rank. Each physical core on the two 64-core AMD EPYC processors on HAWK supports two simultaneous multithreads (SMTs) - two logical cores - providing a total of 256 usable logical cores per node. The example launches a total of 256 threads across 128 ranks, implying that we intend to make use of all logical cores. The omplace -ht compact
option should be used when running GROMACS using simultaneous multithreading as it ensures similarly numbered MPI ranks as well as OpenMP threads belonging to the same MPI rank are executed as close together in the processor’s and indeed the node’s memory hierarchy as possible - in the example, both OpenMP threads run on the same physical core. Not using the compact omplace option was found to be more likely to lead to lower performance when using SMT.
#!/bin/bash -l
#PBS -N benchRIB
#PBS -l select=8:mpiprocs=128:ompthreads=2
#PBS -l walltime=01:00:00
# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=1
mpirun -np 1024 omplace -nt $OMP_NUM_THREADS -ht compact gmx_mpi mdrun -s benchRIB.tpr -ntomp $OMP_NUM_THREADS
MPI x OpenMP hybrid parallel execution and simultaneous multithreading
In order to better understand how GROMACS utilises the available
hardware on HAWK and how to get good performance we can examine the
effect on benchmark performance of the choice of the number of MPI
ranks per node and OpenMP thread and use of SMT for a given number of
HAWK nodes. Doing this systematically is facilitated in the first
instance by disabling dynamic load balancing (-dlb no
) and PME
tuning (-tunepme no
), which also allows us to illustrate the
strength of load imbalance in different execution scenarios.
The figures below show how benchmark performance for GROMACS 2020.2 scales
with increasing node count on HAWK for different combinations of MPI
ranks and OpenMP threads per rank. Results using simultaneous
multithreading (SMT) are not shown as these follow similar trends but
broadly speaking yield slightly lower performance on HAWK for the
benchmarks examined, even with compact placement assigned using
omplace
.
It is clear there is a very significant effect on performance of the choice of MPI x OpenMP hybrid decomposition. As a general rule on HAWK performance is better for fewer (typically 1, 2 or 4) OpenMP threads per MPI rank, and hence more MPI ranks per node. Using more than 1 (but no more than 4) OpenMP threads per rank may enable better performance to be achieved especially on larger number of nodes. OpenMP multithreading also allow runs on larger numbers of nodes to run without requiring care to avoid the fatal abortive error that results from a number of PP ranks that has too large a prime factor as largest divisor, simply by reducing the total number of MPI ranks.
Tuning the number of PME ranks
As mentioned in the manual
the domain decomposition (DD) load balancing functionality, which is
enabled by default and disabled with -dlb no
, is important for
achieving good performance for spatially heterogeneous
systems. However PME and DD load balancing can interfere with each
other. To improve on the unbalanced performance shown in above
benchmark figures therefore, a systematic approach can be taken by
separately tuning the number of PME for a given total number of ranks
using the gmx tune_pme
tool as described in the section on
tune_pme
in the manual.
On HAWK, we could tune the number of PME ranks to improve the
performance of, for example, the benchRIB benchmark running on 16
nodes with 32 MPI ranks per node (512 ranks in total) and 4 OpenMP
threads per rank, which we saw in the above results is already a good
choice considering unbalanced performance and which has scope to
improve significantly through reduction of observed load imbalance.
The following script allows one to run tune_pme
on HAWK to do this
and thereby determine an optimal choice for -npme
:
#!/bin/bash -l
#PBS -N tunepme
#PBS -l select=16:mpiprocs=32:ompthreads=4
#PBS -l walltime=01:00:00
# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=4
export PATH=$HOME/gromacs/2020.2/mpi/AVX2_256/bin:$PATH
gmx_mpi tune_pme -np 512 -mdrun "omplace -nt ${OMP_NUM_THREADS} gmx_mpi mdrun" -s benchRIB.tpr -ntomp $OMP_NUM_THREADS -dlb no
GROMACS performance on Discoverer (PetaSC Bulgaria)
https://sofiatech.bg/en/petascale-supercomputer/
Hardware
- Each node has:
Processors: 2 x 64-core AMD EPYC 7H12 @ 2.6GHz
Memory: 256 GB RAM
Interconnect: Infiniband HDR
Software
- Relevant software stack on system (available to all users via environment modules):
GCC, AOCC, and Intel oneAPI compilers
MPICH, OpenMPI and Intel MPI libraries
Build
A multinode-capable MPI-enabled version of GROMACS using OpenMPI and GCC with good performance on Discoverer can be built as follows:
#!/bin/bash
# download the source code
wget https://ftp.gromacs.org/gromacs/gromacs-2021.4.tar.gz
# extract
tar xvf gromacs-2021.4.tar.gz
# move into the source code directory
cd gromacs-2021.4
# make the build directory
mkdir build
cd build
# load the required modules
module purge
module load cmake
module load fftw/3/3.3.10-gcc-openmpi
module load openmpi/4/gcc/latest
module load openblas/latest-gcc
# cmake configuration
cmake .. -DCMAKE_INSTALL_PREFIX=$(pwd)/.. -DGMX_MPI=ON -DGMX_OPENMP=ON \
-DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx \
-DREGRESSIONTEST_DOWNLOAD=ON \
-DGMX_FFT_LIBRARY=fftw3 -DFFTWF_LIBRARY=/opt/software/fftw/3/3.3.10-gcc-openmpi/lib/libfftw3f.so -DFFTWF_INCLUDE_DIR=/opt/software/fftw/3/3.3.10-gcc-openmpi/include
# build
make -j32
# Optional: test the build
make -j32 check
# install
make install
Run
An example job script to run mdrun
on Discoverer using 2 nodes for 1 hour with 128 mpi tasks per node is shown below.
Each node on Discoverer has two 64-core AMD EPYC processors. This gives a total of 128 physical cores per node.
The CPUs have Simultaneous Multi-Threading (SMT) which means each physical core has two logical cores, so each node will
appear to have 256 cores to most applications. We have found that SMT offers a small performance benefit, thus recommend keeping it turned on. For most jobs this means per node you should have 128 MPI tasks (1 per physical core) and 2 OpenMP threads per MPI task (1 per logical core, 2 logical cores per physical core).
For consistent benchmarking we use the additional arguments explained in General guidance for benchmarking.
#!/bin/bash
#SBATCH --partition=cn
#SBATCH --job-name=gmx_mdrun
#SBATCH --time=24:00:00
#SBATCH --nodes 2 # 2 Nodes
#SBATCH --ntasks-per-node 128 # 1 MPI task per physical core
#SBATCH --cpus-per-task 2 # 2 Open MPI threads per physical core (SMT in use)
module purge
# Load the most recent OpenMPI+GCC build
module load gromacs/2021/latest-intel-nogpu-openmpi-gcc
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
# standard run command for mdrun
mpirun gmx_mpi mdrun -ntomp ${SLURM_CPUS_PER_TASK} -v -s intput.tpr
# command for consistent benchmarking performance
#mpirun gmx_mpi mdrun -ntomp ${SLURM_CPUS_PER_TASK} -v -s input.tpr -resethway -dlb yes -notunepme -noconfout
Comparison of different builds
There are centrally installed versions of GROMACS than can be obtained by module load commands:
module load gromacs/2021/latest-intel-nogpu-mpi
GCC compilers + MPICH MPI.module load gromacs/2021/latest-intel-nogpu-openmpi-gcc
GCC compilers + OpenMPI MPI.module load gromacs/2021/latest-intel-nogpu-openmpi-aocc
AOCC compilers + OpenMPI MPI.module load gromacs/2021/latest-oneapi-nogpu-mpi
Intel compilers + Intel MPI.
The figures below show the performance for the 4 different builds for the benchmark suite.
All benchmarks were run using 128 MPI task per node and 2 OpenMP threads per MPI task (SMT on). With the exception of benchRIB where 32 MPI tasks x 8 OpenMP threads were used. This was done because the error:
Fatal error (abort):
There is no domain decomposition for N ranks that is compatible with the
given box and a minimum cell size of x nm
Change the number of ranks or mdrun option -rdd or -dds
Look in the log file for details on the domain decomposition
is encountered when 1028 total MPI tasks are used, thus hybrid MPI/OpenMP parallelization must be used to scale up to 16 nodes.
The results show that for the smaller systems best performance is obtained with OpenMPI and GCC, if Hybrid MPI and OpenMP is needed then MPICH gives the best performance, and for the largest system MPI+AOCC gives the best performance.
GROMACS performance on Piz Daint (CSCS, Switzerland)
Hardware
Focus on XC50 GPU partitition
Each node:
Processor: 1 x 12-core Intel Xeon E5-2690 v3 @ 2.60GHz (one socket)
Memory: 64GB RAM
GPU: 1 x NVidia P100
CRAY Aries interconnect
Software
Relevant software stack on system (available to all users via environment modules):
Cray MPICH MPI library
Cray-optimised FFTW
Cray-libsci provides BLAS & LAPACK
craype-accel-nvidia60 targets the correct SM architecture to compile for the P100 GPU
Build
Build instructions Piz Daint’s XC50 GPU partition:
module load daint-gpu
module swap PrgEnv-cray PrgEnv-gnu
module load cray-fftw
module load craype-accel-nvidia60
cd gromacs-2020.2
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=${HOME}/gromacs/2020.2 -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC \
-DGMX_MPI=on -DGMX_GPU=on -DGMX_SIMD=AVX2_256 -DGMX_FFT_LIBRARY=fftw3 -DGMX_HWLOC=on
make -j 12
make install
Run
Example job script to run mdrun
on Piz Daint’s XC50 GPU partition for 1 hour on 4 nodes, with 1 MPI rank per node and 12 OpenMP threads per rank, without hyperthreading
#!/bin/bash -l
#SBATCH --job-name=benchmark
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --ntasks-per-core=1 # 1 = no hyperthreading, 2 = with hyperthreading
#SBATCH --hint=nomultithread # nomultithread = no hyperthreading, multithread = hyperthreading
#SBATCH --partition=normal
#SBATCH --constraint=gpu
module swap PrgEnv-cray PrgEnv-gnu
module load daint-gpu
module load cray-fftw
module load craype-accel-nvidia60
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CRAY_CUDA_MPS=1
export PATH=${HOME}/gromacs/2020.2/bin:$PATH
srun gmx_mpi mdrun -s benchmark.tpr -ntomp ${OMP_NUM_THREADS}
GPU offload scenarios, hybrid MPI x OpenMP hybrid parallel execution and simultaneous multithreading
In order to better understand how GROMACS utilises the available
hardware on Piz Daint and how to get good performance we can examine
the effect on benchmark performance of the choice of which
calculations to offload to GPU, the number of MPI ranks per node and
OpenMP threads per rank, and use of simultaneous multithreading. Doing
this systematically is facilitated in the first instance by disabling
dynamic load balancing (-dlb no
) and PME tuning (-tunepme no
),
which also allows us to illustrate the strength of load imbalance in
different execution scenarios.
The figures below show benchmark performance for GROMACS 2020.2 scales with increasing node count on Piz Daint for different combinations of GPU offload scenarios and for different combinations of MPI ranks and OpenMP threads, and with and without use of multithreading (SMT). Only the most relevant, i.e. some of the best-performing execution scenarios are shown for each benchmark. Offloading of coordinate updating is not included as this is only relevant for single node runs using a single rank (GROMACS 2020), where it is however worth using to obtain good performance.
As argued in the general guidance above for GPU offloading, there is a crossover point beyond which overall performance is higher using dedicated PME ranks running on CPU cores rather than offloading PME computation to a single GPU. For smaller systems this point is more likely to lie beyond the number of nodes that are reasonable to use, i.e. past the point of badly diminishing returns in performance gained from emplying additional nodes (low parallel efficiency), meaning that for all runs with reasonable parallel efficiency GPU offloading of PME gives highest performance. For larger systems however the crossover point is before the limit of good scaling, meaning that whereas best performance on small number of nodes is obtained with use of PME offloading, to scale well to larger numbers of nodes this should be disabled and only nonbonded interactions should be offloaded to GPU.
GROMACS performance on Juwels Booster module (JSC, Germany)
https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html
Hardware
Focus on Booster module
- Each node has:
Processor: 2x24 core AMD EPYC 7402 (48 physical cores total, 96 logical cores with SMT)
Memory: 512 GB RAM
GPU: 4x NVIDIA A100, 40 GB
Interconnect: Mellanox HDR200 InfiniBand
Software
- Relevant software stack:
Compilers: GCC, Intel, NVHPC
MPI runtimes: OpenMPI, ParaStationMPI
CUDA
Build
Build instructions to build a GPU version of GROMACS 2022 using CUDA-aware MPI with GCC and OpenMPI:
#!/bin/bash
# Download the source code
wget https://ftp.gromacs.org/gromacs/gromacs-2022.tar.gz
# extract
tar xvf gromacs-2022.tar.gz
# move into the source directory
cd gromacs-2022
# make the build directory
mkdir build
cd build
# load the required modules
module load Stages/2022
module load GCC/11.2.0
module load CUDA/11.5
module load OpenMPI/4.1.2
module load CMake/3.21.1
module load FFTW/3.3.10
cmake .. -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DGMX_FFT_LIBRARY=fftw3 -DGMX_GPU=CUDA -DGMX_MPI=ON -DGMX_OPENMP=ON -DCMAKE_INSTALL_PREFIX=$(pwd)/.. -DGMX_HWLOC=ON
make -j20
make install
Build intructions to build a GPU version of GROMACS 2022 without MPI using GCC:
#!/bin/bash
# Download the source code
wget https://ftp.gromacs.org/gromacs/gromacs-2022.tar.gz
# extract
tar xvf gromacs-2022.tar.gz
# move into the source directory
cd gromacs-2022
# make the build directory
mkdir build
cd build
# load the required modules
module load GCC/11.2.0
module load CUDA/11.5
module load CMake/3.21.1
module list
cmake .. -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DGMX_BUILD_OWN_FFTW=yes -DGMX_GPU=CUDA -DGMX_MPI=OFF -DGMX_OPENMP=ON -DCMAKE_INSTALL_PREFIX=$(pwd)/.. -DGMX_HWLOC=ON
make -j20
make install
Run
Batch script to run on one node using all 4 GPUs; with GPU offloading for nonbonded, pme, and bonded; and GPU direct communication enabled.
#!/bin/bash
#SBATCH --account=<ACCOUNT>
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --hint=nomultithread # turn off SMT
#SBATCH --ntasks-per-node=4 # 1 MPI task per GPU
#SBATCH --cpus-per-task=12
#SBATCH --time=01:00:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4 # Use all 4 GPUs
# load the required modules
module load GCC/11.2.0
module load CUDA/11.5
module load OpenMPI/4.1.2
module load FFTW/3.3.10
# Allow GROMACS to see all 4 GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Enable direct GPU to GPU communications
export GMX_ENABLE_DIRECT_GPU_COMM=true
# activate user install of GROMACS 2022
source <user install location>/gromacs-2022/bin/GMXRC
srun gmx_mpi mdrun -s INPUT.tpr -ntomp ${SLURM_CPUS_PER_TASK} -nb gpu -pme gpu -bonded gpu -npme 1
Benchmark results
We investaged the difference between different GPU offloading scenarios and other performance settings for the strong scaling of our benchmark suite.
All simulations were run using 1 MPI task per GPU, 12 OpenMP threads per MPI task, and SMT turned off.
- The different options are as follows:
nb - only non-bonded interactions are offloaded to the GPU.
nb, pme, bonded - all interactions (non-bonded, pme, and bonded) are offloaded to GPU.
direct GPU comms - the GMX_ENABLE_DIRECT_GPU_COMM flag is set. This is a non-default feature that enables communications between GPUs to done directly between the GPU memory spaces rather than being routed via the CPUs. This can offer performace benefits. It is only possible when using GROMACS internal thread-MPI or if GROMACS was build with a CUDA aware MPI library.
cpu_bind - explicit CPU bindings are used with the srun command. The binding command used is
srun --cpu_bind=mask_cpu:0xFFF000,0xFFF,0xFFF000000000,0xFFF000000 gmx_mpi mdrun
. This is due to the specific hardware layout of JuwelsBooster.update - The update GPU option is used. This can only be used when the constrains settings are changed from
constraints=all-bonds
toconstraints=h-bonds
. (We can only do this for the benchPEP system where we use the benchPEP-h.tpr from the previously referenced webpage).
We see that best performance is obtained when as much as possible is offloaded to the GPU. The explicit CPU binding settings slightly increase performance. The largest extra performace increase occurs when the update gpu
option can be set – this results in as much of the calculation as possible being run on the GPU. The direct GPU communications settings also increases performace, however this feature is still in development and sometimes gives errors so output should be checked before using it.
The parallel scaling is generally poor when more than 1 node is used. This is because GROMACS 2022 can only have 1 PME GPU rank, for 1 node on JuwelsBooster there are 4 GPUs so 1/4 GPUs are used for PME calculations, this is close to ideal for typical GROMACS simulations. For 2 nodes then 1/8 GPUs are used for PME calculations, this is under-balanced relative to the other 7 GPUs doing PP calculations which have to wait for the PME calculations resulting in poorer parallel performance.
Multiple simulations per Node/ per GPU
Only large systems can fully make use of an entire Juwels Booster node (4 GPUs). For smaller systems It can be beneficial to have multiple simulations per node, or even per GPU to get higher throughput. On Juwels booster you are always charged for a full node.
We have found that you must use the Non-MPI version of gromacs for this. This is because it can be lauched without srun so there is more flexibility in sharing the GPU resources.
Suppose you have a set of 8 folders labelled 1 2 3 4 5 6 7 8. You can run 8 simultaneous simulations, each using 6 threads, and each sharing a GPU with one other simulation. The script below will lauch such a simulation:
#!/bin/bash
#SBATCH --account=<ACCOUNT>
#SBATCH --nodes=1
#SBATCH --ntasks=8 # number of simulations
#SBATCH --hint=nomultithread
#SBATCH --ntasks-per-node=8 # number of simulations
#SBATCH --cpus-per-task=6 # threads per simulation
#SBATCH --time=01:00:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4 # use all 4 GPUs
#SBATCH --cuda-mps # turn on CUDA MPS to enable multiple processes per GPU
# load the required modules
module load GCC/11.2.0
module load CUDA/11.5
module list
# Allow GROMACS to see all 4 GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
# make sure we are using a non-mpi version of gromacs
source <user install location>/gromacs-2022_NOMPI/bin/GMXRC
# input file
TPR=input.tpr
EXE="gmx mdrun" # Non MPI mdrun
# Set thread mpi to 1 with -ntmpi option
SETTINGS="-s ${TPR} -ntmpi 1 -ntomp ${SLURM_CPUS_PER_TASK} -nb gpu -pme gpu -bonded gpu"
# Options below for 4,8,16,24, or 48 simulations per node.
# Uncomment the desired "folder" option and the corresponding "gputasks" option,
# make sure the rest are commented out and ensure the --ntasks,
# --ntasks-per-node, and --cpus-per-task are set to the corresponding values.
# Currently the 8 simulations option is being used.
# 4 simulations: (make sure ntasks and ntasks-per-node=4 and cpus-per-task=12)
#folders=(1 2 3 4)
#gputasks=(11 00 33 22)
# 8 simulations: (make sure ntasks and ntasks-per-node=8 and cpus-per-task=6)
folders=(1 2 3 4 5 6 7 8)
gputasks=(11 11 00 00 33 33 22 22)
# 16 simulations: (make sure ntasks and ntasks-per-node=16 and cpus-per-task=3)
#folders=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16)
#gputasks=(11 11 11 11 00 00 00 00 33 33 33 33 22 22 22 22)
# 24 simulations: (make sure ntasks and ntasks-per-node=24 and cpus-per-task=2)
#folders=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24)
#gputasks=(11 11 11 11 11 11 00 00 00 00 00 00 33 33 33 33 33 33 22 22 22 22 22 22)
#48 simulations: (make sure ntasks and ntasks-per-node=48 and cpus-per-task=1)
#folders=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47)
#gputasks=(11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 33 33 33 33 33 33 33 33 33 33 33 33 22 22 22 22 22 22 22 22 22 22 22 22 )
# make cpu bindings to ensure cores are not shared between the simulations
# this works for any of the simulation number options, does not need to be changed
bindings=($( for (( i=0 ; i<48 ; i+=${SLURM_CPUS_PER_TASK} )); do echo $i-$((${i} + ${SLURM_CPUS_PER_TASK} - 1 )); done ))
# This works in conjunction with the gputasks settings which map the process to GPU close to the
# specified GPU.
# lauch all simulations in the specified folders in parallel
for i in ${!folders[@]}
do
cd ${folders[$i]}
pwd
echo ${bindings[$i]}
# use numactl to bind the simulations to specific cores.
numactl --physcpubind=${bindings[$i]} ${EXE} ${SETTINGS} -gputasks ${gputasks[$i]} > out.txt 2>&1 &
cd ..
done
wait
The script also contains options for 4,16,24, or 48 simulations per node.
CUDA MPS in enabled in the script which allows for more efficient GPU sharing.
The folders array contains 8 folder names. The gputasks array contains the GROMACs gpu ID mapping. There are two numbers per simulation for the two GPU task per simualtion: PP and PME. These are the same number so they run on the same GPU. The order is specific to Juwels Boosters setup.
The bindings array evaluates to: (0-5 6-11 12-17 18-23 24-29 30-35 36-41 42-47)
With the gputasks array: (11 11 00 00 33 33 22 22)
This results in the first simulation using cpus 0 to 5 and gpu 1. CPUs 0 to 5 are closest connected to GPU 1 as per the Juwels Booster hardware information: https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html.
The plot below shows the total throughput on one node for multiple simulations. This is for the benchMEM benchmark.
BenchMEM – throughput for multiple simulations per Node/ per GPU.
We can see that highest throughput is achived when 16 simulations are run per node (4 simulations per GPU). For smaller systems the optimum number of simulations per node will be larger, for larger systems it will be smaller. The CUDA MPS setting significantly increases the performace, this is because this setting allows mutliple process to better share a single GPU.
GROMACS Reference Benchmarks Performance on PRACE/EuroHPC machines
This section provides a reference set of benchmark simulation performance results representative of good obtainable performance on PRACE/EuroHPC machines with GROMACS built and run according to best practice outlined in this guide.
These results are intended as a convenient reference to help researchers estimate compute time requirements for their proposed research in preparation for applying to HPC resource allocation calls.
[Table showing ns/day and walltime hours/ns with good performance (PME-tuned, dlb on, etc.) for above benchmarks on these machines, i.e. summary of good achievable building on above results to optimise]