Performance Cookbook

The performance cookbook part of the GROMACS best practice guide assumes your simulations are prepared appropriately and provides concrete guidance on how best to run GROMACS simulations, i.e. execute mdrun, so as to make good use of available hardware and obtain results in the shortest time possible, be it on a laptop, a multi-GPU desktop workstation, a departmental cluster, and especially on large supercomputers. This complements and provides a bridge into navigating the detailed information provided in the “getting good performance from mdrun” page in the GROMACS manual:

http://manual.gromacs.org/current/user-guide/mdrun-performance.html

GROMACS can generally be launched without specifying anything other than the essential simulation parameters, as it has built-in heuristics that enable it to detect the underlying hardware and use accumulated insights about good performance embedded in the code to make usually reasonable choices given the available number and types of CPU cores and/or GPUs. By default GROMACS also adapts dynamically during execution to improve performance. However for any given simulation it is possible that better than default choices exist. Understanding how to control these by explicitly specifying parallel execution options and how best to approach obtaining optimal use of available hardware can make a significant difference to throughput and hence scientific results achieved over a given timespan, as well as to financial (cost) and environmental (energy usage) efficiency. For using high-performance computing resources there is additionally clear value in knowing what scale of resources (number of cores / nodes / GPUs) are efficient to use, or how to go about finding this out.

As well as general guidance applicable to whatever machine you may be running GROMACS on, the performance cookbook provides concrete examples showing how to obtain good performance on a number of specific PRACE and EuroHPC machines, both as an illustration of the application of general best practice process for obtaining good performance, and to promote efficient usage of the named machines. The cookbook also provides a reference set of (near) optimal benchmark performance results obtained on these machines using best practice in order to aid estimation of required compute time allocations for researchers requesting such time.

General guidance on running mdrun and strategy for getting good performance

Single node

When using GROMACS on a single node, that is to say a machine such as a laptop, desktop workstation, or server, where all processor cores or GPUs have access to a single shared memory, we typically run the thread-mpi version of mdrun (gmx mdrun).

CPU only

Running GROMACS with the default of one thread-mpi rank per core on a single-node CPU-only machine is often optimal, however one can experiment with different combinations of numbers of thread-MPI ranks (controlled by varying -ntmpi N) and numbers of OpenMP threads per rank (controlled by varying -ntomp M and setting the environment variable OMP_NUM_THREADS=M), such that M x N is equal to the total number of CPU cores available on the node. Many processors support simultaneous multithreading (SMT), known as hyperthreading on Intel processors, whereby each physical core can efficiently execute multiple threads or processes. Enabling multithreading may boost performance. To make use of SMT, -ntmpi N and -ntomp M should be chosen such that M x N equals the number of logical cores identified by the operating systems, which is equal to the number of physical processor cores multiplied by the number of simultaneous multithreads supported by each physical core.

If executing on more than at least 12 ranks GROMACS by default dedicates a certain fraction of the ranks to PME calculations. The number of PME ranks is based on various heuristics that reflect what the developers expect to be optimal based on the underlying algorithms and what has often been found to give good performance. During execution GROMACS attempts to reduce the load imbalance between PP and PME ranks, unless this functionality is disabled by choosing -tunepme no. Even when enabled however, this dynamic PME tuning does not change the number of PME ranks from what GROMACS decides according to its built-in heuristics. It is possible to explicitly specify the number of PME ranks to use with the -npme option. You should examine the md.log file produced and check if there is any warning suggesting to try use a larger or smaller number of PME ranks for your simulation. A more systematic way to determine a performance-optimal number of PME ranks for a given total number of ranks can be determined using the gmx tune_pme tool as described in gmx tune_pme

For details, see Running mdrun within a single node and Process(-or) level parallelization via OpenMP

CPU + GPU

By default GROMACS launches one thread-mpi rank per GPU. It is worth experimenting running with more thread-mpi ranks, up to one rank per CPU core. Whatever the number of ranks, choose -ntomp to ensure all available CPU cores are used, and apply similar considerations to adjust -ntomp for hyperthreading/SMT-capable processors as described above for CPU-only execution.

GPU offload options

By default GROMACS offloads short-range non-bonded force calculations (-nb gpu), PME calculations (-pme gpu) and, on NVIDIA GPUs, bonded force calculations (-bonded gpu). For GROMACS 2022 PME offload to GPU can only be done by a single rank (i.e. a single GPU). PME offload is also subject to a number of further known limitations.

It is worth experimenting with offload options taking into account the performance considerations for GPU tasks and considering your hardware, in particular the number and respective age and computational power of your processor(s) and your GPU(s). For example, FFTs are offloaded by default if PME is offloaded (which also happens by default), but this can be avoided with -pmefft cpu and may be beneficial for an older GPU sitting alongside newer CPU.

Constraint calculations and coordinate updates default to CPU but can be offloaded with -update gpu, though only to NVIDIA GPUs, only if GROMACS is executes on a single rank, and (for GROMACS 2020) subject to further limitations (no free-energy, no virtual sites, no Ewald surface correction, no replica exchange, no constraint pulling, no orientation restraints and no computational electrophysiology).

For details, see Running mdrun within a single node, Node level parallelization via GPU offloading and thread-MPI, and Running mdrun with GPUs

Multiple networked compute nodes in a cluster or supercomputer

In order to run GROMACS on a multi-node distributed-memory machine such as a supercomputer we need to run the MPI-enabled version using gmx_mpi mdrun (or simply mdrun_mpi), rather than gmx mdrun (or simply mdrun). In addition, GROMACS should be launched with a parallel application launcher (mpirun, mpiexec, srun, or aprun), which sets the number of MPI ranks N:

mpirun -np N gmx_mpi mdrun <mdrun options>

CPU only

As for single-node simulations, running with 1 rank per core and 1 OpenMP thread per rank and therefore with as many MPI ranks per node as there are cores on each node is often optimal. It is however worth experimenting with multiple OpenMP threads per rank, especially as this often helps retain higher performance when scaling to more nodes.

Runs using multiple nodes are very likely to cause GROMACS to automatically spawn dedicated PME ranks, for which you are advised to follow the guidance already given above for the single node case. You should be aware that if the number of PP ranks resulting from the combined choice of total number of MPI ranks and number of PME ranks has a largest prime divisor that GROMACS considers too large to give good performance for the domain decomposition, it will throw a fatal error and abort execution. In these cases it may be worth adjusting the total number of ranks and/or the number of PME ranks to obtain better performance, using OpenMP threading where necessary to ensure all cores on each node are utilised, though it’s not inconceivable that overall performance may be larger with an optimal choice of number of PME ranks even if this leaves a small number of cores on each node unused. As explained for the single-node case, performance-optimal number of PME ranks for a given total number of ranks can be determined in a systematic way using the gmx tune_pme tool as described in gmx tune_pme

The OpenMP threading for PME ranks can be chosen to be different than for standard (i.e. PP) ranks using the -ntomp_pme option, providing added flexibility to help utilise all cores on each node.

CPU + GPU

Broadly the same considerations as for single-node use apply with regards to determining an optimal choice of number of MPI ranks x OpenMP threads and GPU offloading options.

Offloading PME calculations to GPU can only take place on a single rank using a single GPU (GROMACS 2022). The performance advantage this typically gives over CPU-based PME computation on a single or small number of nodes can therefore start to diminish when running on larger numbers of nodes thanks to the growing compute power of an increasing number of PME ranks running on CPU cores. One therefore expects a crossover point beyond which it is faster on a given number of nodes to run multiple dedicated PME ranks - e.g. a number chosen optimally using `tune_pme` - on CPU cores, instead of offloading all PME calculations to a single GPU. The node count at which this occurs will depend on factors such as the system size, chosen MD parameters, and the relative performance of GPUs and CPUs available in a machine.

General guidance for benchmarking

Before beginning expensive mdrun simulations you should benchmark your system to ensure you are using the optimal amount of HPC resources. Usually this mean how many nodes/CPUs/GPUs you choose to use.

To do this you should take your system, run it for a short time (10,000 steps should be sufficient) and increase the number of CPUs. You then look at the performance figure in ns/day. It is helpful to plot a graph of CPU count vs performance. Examples of these are shown in the following section for different PRACE/EuroHPC machines.

You will see that for increasing CPU count you get diminishing returns on performance gains. It is up to you to choose a core count that balances the time you have to wait for the simulation to finish, and the cost of running the simulations. Often the appropriate number is the point on the performance curve before it starts to plateau.

Some mdrun flags that can help with benchmarking are:

  • -dlb yes turns on dynamic load balancing which shifts particles between MPI ranks to optimize performance. This can interfere with the tunepme setting which will optimize various aspects of the PME and DD algorithms, shifting load between ranks.

  • -notunepme turns off PME load balancing because it can interfere with the dlb yes setting. The PME settings can be tuned separately using gmx tune_pme.

  • -noconfout does not create the output conformation as this is not needed for benchmarking.

  • -resethway resets the performance timers halfway through the run, this removes the overhead of initialization and load balancing from the reported timings.

  • -nsteps 10000 forces mdrun to run for only 10000 steps, this lets you run short benchmarks using your production ready tpr file.

Things to investigate:

  • Some systems will have different versions of GROMACS built by different compilers, these often have differing performance.

  • Some systems will have the ability to turn on Simultaneous Multi-Threading (SMT), it may provide a performance boost for GROMACS.

  • For large systems using many nodes then the use of Hybrid OpenMP/MPI cat offer increased performance over pure MPI.

GPU specific notes:

  • For GPU systems you can try different GPU offload scenarios.

  • For GPU systems it is generally best to have 1 MPI rank or 1 thread-MPI rank per GPU, but this may not always be the case.

  • For GPU systems the parallel efficiency beyond 1 GPU is often poor, you will not see the same strong scaling as on CPU only systems. However the performance on 1 GPU should be greater than on an equivalent CPU node.

Getting good GROMACS performance on PRACE/EuroHPC machines

This section provides guidance and concrete recipes showing how to build and run GROMACS for good performance on some of the largest EU-based supercomputers available to EU researchers through PRACE and EuroHPC.

General guidance for each machine is complemented by an analysis illustrating the effect of key runtime execution choices on mdrun performance for a range of benchmark simulations. Comparison between these and your own simulations should help you determine how best to run your own simulations on these machines and how to obtain good performance.

Benchmarks

A brief description is provided below of the benchmarks used to illustrate how to obtain good performance on PRACE/EuroHPC machines for a range of system sizes and types.

Benchmarks prefixed with “bench” are available from the Dept. of Theoretical and Computational Biophysics at the MPI for biophysical Chemistry, Göttingen: https://www.mpibpc.mpg.de/grubmueller/bench

Benchmarks suffixed with “_HBS” are available from the UK’s HECBioSim consortium of computational biomolecular researchers: https://www.hecbiosim.ac.uk/benchmarks

  • 20k_HBS:
    • 3NIR Crambin

    • Total number of atoms: 19,605

    • Protein atoms: 642

    • Water atoms: 18,963

    • Input parameters: 20k_HBS.mdp

  • benchMEM:
    • Protein in membrane, surrounded by water

    • Total number of atoms: 82k

    • Input parameters: benchMEM.mdp

  • 465k_HBS:
    • hEGFR Dimer of 1IVO and 1NQL

    • Total number of atoms: 465,399

    • Protein atoms: 21,749

    • Lipid atoms: 134,268

    • Water atoms: 309,087

    • Ions: 295

    • Input parameters: 465k_HBS.mdp

  • benchRIB:
    • Ribosome in water

    • Total number of atoms: 2M

    • Input parameters: benchRIB.mdp

  • benchPEP:
    • Peptides in water

    • Total number of atoms: 12M

    • Input parameters: benchPEP.mdp

GROMACS performance on HAWK (HLRS, Germany)

https://www.hlrs.de/systems/hpe-apollo-hawk/

HAWK is listed as number 16 on the Top500 (November 2020), the 6th largest European HPC system, and is accessible through PRACE access mechanisms.

Hardware

Each node has:

  • Processors: 2 x 64-core AMD EPYC 7742 @2.25

  • Memory: 256GB RAM

  • Interconnect: InfiniBand HDR200

Software

Relevant software stack on system (available to all users via environment modules):

  • HPE MPT and OpenMPI MPI libraries

  • FFTW (Zen2 architecture-specific build)

Build

A multinode-capable MPI-enabled version of GROMACS with good performance on HAWK can be built as follows:

module load fftw

cd gromacs-2020.2
mkdir build
cd build

cmake .. -DCMAKE_INSTALL_PREFIX=${HOME}/gromacs/2020.2 -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx \
         -DGMX_MPI=on -DGMX_SIMD=AVX2_256 -DGMX_GPU=off -DGMX_BUILD_SHARED_EXE=off -DBUILD_SHARED_LIBS=off \
         -DGMX_FFT_LIBRARY=fftw3 -DCMAKE_PREFIX_PATH=${FFTW_ROOT}

make -j 64
make install

Run

The example job script below shows how to run GROMACS on HAWK for 1 hour on 8 nodes with 128 MPI ranks per node and 2 OpenMP threads per rank. Each physical core on the two 64-core AMD EPYC processors on HAWK supports two simultaneous multithreads (SMTs) - two logical cores - providing a total of 256 usable logical cores per node. The example launches a total of 256 threads across 128 ranks, implying that we intend to make use of all logical cores. The omplace -ht compact option should be used when running GROMACS using simultaneous multithreading as it ensures similarly numbered MPI ranks as well as OpenMP threads belonging to the same MPI rank are executed as close together in the processor’s and indeed the node’s memory hierarchy as possible - in the example, both OpenMP threads run on the same physical core. Not using the compact omplace option was found to be more likely to lead to lower performance when using SMT.

#!/bin/bash -l
#PBS -N benchRIB
#PBS -l select=8:mpiprocs=128:ompthreads=2
#PBS -l walltime=01:00:00

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=1

mpirun -np 1024 omplace -nt $OMP_NUM_THREADS -ht compact gmx_mpi mdrun -s benchRIB.tpr -ntomp $OMP_NUM_THREADS

MPI x OpenMP hybrid parallel execution and simultaneous multithreading

In order to better understand how GROMACS utilises the available hardware on HAWK and how to get good performance we can examine the effect on benchmark performance of the choice of the number of MPI ranks per node and OpenMP thread and use of SMT for a given number of HAWK nodes. Doing this systematically is facilitated in the first instance by disabling dynamic load balancing (-dlb no) and PME tuning (-tunepme no), which also allows us to illustrate the strength of load imbalance in different execution scenarios.

The figures below show how benchmark performance for GROMACS 2020.2 scales with increasing node count on HAWK for different combinations of MPI ranks and OpenMP threads per rank. Results using simultaneous multithreading (SMT) are not shown as these follow similar trends but broadly speaking yield slightly lower performance on HAWK for the benchmarks examined, even with compact placement assigned using omplace.

It is clear there is a very significant effect on performance of the choice of MPI x OpenMP hybrid decomposition. As a general rule on HAWK performance is better for fewer (typically 1, 2 or 4) OpenMP threads per MPI rank, and hence more MPI ranks per node. Using more than 1 (but no more than 4) OpenMP threads per rank may enable better performance to be achieved especially on larger number of nodes. OpenMP multithreading also allow runs on larger numbers of nodes to run without requiring care to avoid the fatal abortive error that results from a number of PP ranks that has too large a prime factor as largest divisor, simply by reducing the total number of MPI ranks.

../_images/20k_HBS.svg

20k_HBS

../_images/benchMEM.svg

benchMEM

../_images/465k_HBS.svg

465k_HBS

../_images/benchRIB.svg

benchRIB

../_images/benchPEP.svg

benchPEP

Tuning the number of PME ranks

As mentioned in the manual the domain decomposition (DD) load balancing functionality, which is enabled by default and disabled with -dlb no, is important for achieving good performance for spatially heterogeneous systems. However PME and DD load balancing can interfere with each other. To improve on the unbalanced performance shown in above benchmark figures therefore, a systematic approach can be taken by separately tuning the number of PME for a given total number of ranks using the gmx tune_pme tool as described in the section on tune_pme in the manual.

On HAWK, we could tune the number of PME ranks to improve the performance of, for example, the benchRIB benchmark running on 16 nodes with 32 MPI ranks per node (512 ranks in total) and 4 OpenMP threads per rank, which we saw in the above results is already a good choice considering unbalanced performance and which has scope to improve significantly through reduction of observed load imbalance. The following script allows one to run tune_pme on HAWK to do this and thereby determine an optimal choice for -npme:

#!/bin/bash -l
#PBS -N tunepme
#PBS -l select=16:mpiprocs=32:ompthreads=4
#PBS -l walltime=01:00:00

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=4

export PATH=$HOME/gromacs/2020.2/mpi/AVX2_256/bin:$PATH

gmx_mpi tune_pme -np 512 -mdrun "omplace -nt ${OMP_NUM_THREADS} gmx_mpi mdrun" -s benchRIB.tpr -ntomp $OMP_NUM_THREADS -dlb no

GROMACS performance on Discoverer (PetaSC Bulgaria)

https://sofiatech.bg/en/petascale-supercomputer/

Hardware

Each node has:
  • Processors: 2 x 64-core AMD EPYC 7H12 @ 2.6GHz

  • Memory: 256 GB RAM

  • Interconnect: Infiniband HDR

Software

Relevant software stack on system (available to all users via environment modules):
  • GCC, AOCC, and Intel oneAPI compilers

  • MPICH, OpenMPI and Intel MPI libraries

Build

A multinode-capable MPI-enabled version of GROMACS using OpenMPI and GCC with good performance on Discoverer can be built as follows:

#!/bin/bash

# download the source code
wget https://ftp.gromacs.org/gromacs/gromacs-2021.4.tar.gz

# extract
tar xvf gromacs-2021.4.tar.gz

# move into the source code directory
cd gromacs-2021.4

# make the build directory
mkdir build
cd build

# load the required modules
module purge
module load cmake
module load fftw/3/3.3.10-gcc-openmpi
module load openmpi/4/gcc/latest
module load openblas/latest-gcc

# cmake configuration
cmake .. -DCMAKE_INSTALL_PREFIX=$(pwd)/.. -DGMX_MPI=ON -DGMX_OPENMP=ON \
    -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx \
    -DREGRESSIONTEST_DOWNLOAD=ON \
    -DGMX_FFT_LIBRARY=fftw3 -DFFTWF_LIBRARY=/opt/software/fftw/3/3.3.10-gcc-openmpi/lib/libfftw3f.so -DFFTWF_INCLUDE_DIR=/opt/software/fftw/3/3.3.10-gcc-openmpi/include

# build
make -j32

# Optional: test the build

make -j32 check

# install
make install

Run

An example job script to run mdrun on Discoverer using 2 nodes for 1 hour with 128 mpi tasks per node is shown below. Each node on Discoverer has two 64-core AMD EPYC processors. This gives a total of 128 physical cores per node. The CPUs have Simultaneous Multi-Threading (SMT) which means each physical core has two logical cores, so each node will appear to have 256 cores to most applications. We have found that SMT offers a small performance benefit, thus recommend keeping it turned on. For most jobs this means per node you should have 128 MPI tasks (1 per physical core) and 2 OpenMP threads per MPI task (1 per logical core, 2 logical cores per physical core).

For consistent benchmarking we use the additional arguments explained in General guidance for benchmarking.

#!/bin/bash
#SBATCH --partition=cn
#SBATCH --job-name=gmx_mdrun
#SBATCH --time=24:00:00
#SBATCH --nodes           2   # 2 Nodes
#SBATCH --ntasks-per-node 128 # 1 MPI task per physical core
#SBATCH --cpus-per-task   2   # 2 Open MPI threads per physical core (SMT in use)

module purge

# Load the most recent OpenMPI+GCC build
module load gromacs/2021/latest-intel-nogpu-openmpi-gcc


export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}


# standard run command for mdrun
mpirun gmx_mpi mdrun -ntomp ${SLURM_CPUS_PER_TASK} -v -s intput.tpr


# command for consistent benchmarking performance
#mpirun gmx_mpi mdrun -ntomp ${SLURM_CPUS_PER_TASK} -v -s input.tpr -resethway -dlb yes -notunepme -noconfout

Comparison of different builds

There are centrally installed versions of GROMACS than can be obtained by module load commands:

  • module load gromacs/2021/latest-intel-nogpu-mpi GCC compilers + MPICH MPI.

  • module load gromacs/2021/latest-intel-nogpu-openmpi-gcc GCC compilers + OpenMPI MPI.

  • module load gromacs/2021/latest-intel-nogpu-openmpi-aocc AOCC compilers + OpenMPI MPI.

  • module load gromacs/2021/latest-oneapi-nogpu-mpi Intel compilers + Intel MPI.

The figures below show the performance for the 4 different builds for the benchmark suite.

../_images/bench20k_HBS.svg

20k_HBS – 128 MPI x 2 OpenMP

../_images/benchMEM1.svg

benchMEM – 128 MPI x 2 OpenMP

../_images/bench465k_HBS.svg

465k_HBS – 128 MPI x 2 OpenMP

../_images/benchRIB1.svg

benchRIB – 32 MPI x 8 OpenMP

../_images/benchPEP1.svg

benchPEP – 128 MPI x 2 OpenMP

All benchmarks were run using 128 MPI task per node and 2 OpenMP threads per MPI task (SMT on). With the exception of benchRIB where 32 MPI tasks x 8 OpenMP threads were used. This was done because the error:

Fatal error (abort):

There is no domain decomposition for N ranks that is compatible with the
given box and a minimum cell size of x nm
Change the number of ranks or mdrun option -rdd or -dds
Look in the log file for details on the domain decomposition

is encountered when 1028 total MPI tasks are used, thus hybrid MPI/OpenMP parallelization must be used to scale up to 16 nodes.

The results show that for the smaller systems best performance is obtained with OpenMPI and GCC, if Hybrid MPI and OpenMP is needed then MPICH gives the best performance, and for the largest system MPI+AOCC gives the best performance.

GROMACS performance on Piz Daint (CSCS, Switzerland)

Hardware

Focus on XC50 GPU partitition

Each node:

  • Processor: 1 x 12-core Intel Xeon E5-2690 v3 @ 2.60GHz (one socket)

  • Memory: 64GB RAM

  • GPU: 1 x NVidia P100

  • CRAY Aries interconnect

Software

Relevant software stack on system (available to all users via environment modules):

  • Cray MPICH MPI library

  • Cray-optimised FFTW

  • Cray-libsci provides BLAS & LAPACK

  • craype-accel-nvidia60 targets the correct SM architecture to compile for the P100 GPU

Build

Build instructions Piz Daint’s XC50 GPU partition:

module load daint-gpu
module swap PrgEnv-cray PrgEnv-gnu
module load cray-fftw
module load craype-accel-nvidia60

cd gromacs-2020.2
mkdir build
cd build

cmake .. -DCMAKE_INSTALL_PREFIX=${HOME}/gromacs/2020.2 -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC \
         -DGMX_MPI=on -DGMX_GPU=on -DGMX_SIMD=AVX2_256 -DGMX_FFT_LIBRARY=fftw3 -DGMX_HWLOC=on

make -j 12
make install

Run

Example job script to run mdrun on Piz Daint’s XC50 GPU partition for 1 hour on 4 nodes, with 1 MPI rank per node and 12 OpenMP threads per rank, without hyperthreading

#!/bin/bash -l
#SBATCH --job-name=benchmark
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --ntasks-per-core=1     # 1 = no hyperthreading, 2 = with hyperthreading
#SBATCH --hint=nomultithread    # nomultithread = no hyperthreading, multithread = hyperthreading
#SBATCH --partition=normal
#SBATCH --constraint=gpu

module swap PrgEnv-cray PrgEnv-gnu
module load daint-gpu
module load cray-fftw
module load craype-accel-nvidia60

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CRAY_CUDA_MPS=1

export PATH=${HOME}/gromacs/2020.2/bin:$PATH

srun gmx_mpi mdrun -s benchmark.tpr -ntomp ${OMP_NUM_THREADS}

GPU offload scenarios, hybrid MPI x OpenMP hybrid parallel execution and simultaneous multithreading

In order to better understand how GROMACS utilises the available hardware on Piz Daint and how to get good performance we can examine the effect on benchmark performance of the choice of which calculations to offload to GPU, the number of MPI ranks per node and OpenMP threads per rank, and use of simultaneous multithreading. Doing this systematically is facilitated in the first instance by disabling dynamic load balancing (-dlb no) and PME tuning (-tunepme no), which also allows us to illustrate the strength of load imbalance in different execution scenarios.

The figures below show benchmark performance for GROMACS 2020.2 scales with increasing node count on Piz Daint for different combinations of GPU offload scenarios and for different combinations of MPI ranks and OpenMP threads, and with and without use of multithreading (SMT). Only the most relevant, i.e. some of the best-performing execution scenarios are shown for each benchmark. Offloading of coordinate updating is not included as this is only relevant for single node runs using a single rank (GROMACS 2020), where it is however worth using to obtain good performance.

As argued in the general guidance above for GPU offloading, there is a crossover point beyond which overall performance is higher using dedicated PME ranks running on CPU cores rather than offloading PME computation to a single GPU. For smaller systems this point is more likely to lie beyond the number of nodes that are reasonable to use, i.e. past the point of badly diminishing returns in performance gained from emplying additional nodes (low parallel efficiency), meaning that for all runs with reasonable parallel efficiency GPU offloading of PME gives highest performance. For larger systems however the crossover point is before the limit of good scaling, meaning that whereas best performance on small number of nodes is obtained with use of PME offloading, to scale well to larger numbers of nodes this should be disabled and only nonbonded interactions should be offloaded to GPU.

../_images/20k_HBS1.svg

20k_HBS - 2 mpi x 6 omp (SMT off)

../_images/benchMEM2.svg

benchMEM - 4 mpi x 6 omp (SMT on)

../_images/465k_HBS1.svg

465k_HBS - 8 mpi x 3 omp (SMT on)

../_images/benchRIB2.svg

benchRIB - 4 mpi x 6 omp (SMT on)

../_images/benchPEP2.svg

benchPEP - 4 mpi x 6 omp (SMT on)

GROMACS performance on Juwels Booster module (JSC, Germany)

https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html

Hardware

Focus on Booster module

Each node has:
  • Processor: 2x24 core AMD EPYC 7402 (48 physical cores total, 96 logical cores with SMT)

  • Memory: 512 GB RAM

  • GPU: 4x NVIDIA A100, 40 GB

  • Interconnect: Mellanox HDR200 InfiniBand

Software

Relevant software stack:
  • Compilers: GCC, Intel, NVHPC

  • MPI runtimes: OpenMPI, ParaStationMPI

  • CUDA

Build

Build instructions to build a GPU version of GROMACS 2022 using CUDA-aware MPI with GCC and OpenMPI:

#!/bin/bash

# Download the source code
wget https://ftp.gromacs.org/gromacs/gromacs-2022.tar.gz

# extract
tar xvf gromacs-2022.tar.gz

# move into the source directory
cd gromacs-2022

# make the build directory
mkdir build
cd build


# load the required modules
module load Stages/2022
module load GCC/11.2.0
module load CUDA/11.5
module load OpenMPI/4.1.2
module load CMake/3.21.1
module load FFTW/3.3.10

cmake .. -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DGMX_FFT_LIBRARY=fftw3 -DGMX_GPU=CUDA -DGMX_MPI=ON -DGMX_OPENMP=ON -DCMAKE_INSTALL_PREFIX=$(pwd)/.. -DGMX_HWLOC=ON

make -j20

make install

Build intructions to build a GPU version of GROMACS 2022 without MPI using GCC:

#!/bin/bash

# Download the source code
wget https://ftp.gromacs.org/gromacs/gromacs-2022.tar.gz

# extract
tar xvf gromacs-2022.tar.gz

# move into the source directory
cd gromacs-2022

# make the build directory
mkdir build
cd build


# load the required modules
module load GCC/11.2.0
module load CUDA/11.5
module load CMake/3.21.1


module list

cmake .. -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DGMX_BUILD_OWN_FFTW=yes -DGMX_GPU=CUDA -DGMX_MPI=OFF -DGMX_OPENMP=ON -DCMAKE_INSTALL_PREFIX=$(pwd)/.. -DGMX_HWLOC=ON

make -j20

make install

Run

Batch script to run on one node using all 4 GPUs; with GPU offloading for nonbonded, pme, and bonded; and GPU direct communication enabled.

#!/bin/bash
#SBATCH --account=<ACCOUNT>
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --hint=nomultithread # turn off SMT
#SBATCH --ntasks-per-node=4  # 1 MPI task per GPU
#SBATCH --cpus-per-task=12
#SBATCH --time=01:00:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4        # Use all 4 GPUs


# load the required modules
module load GCC/11.2.0
module load CUDA/11.5
module load OpenMPI/4.1.2
module load FFTW/3.3.10


# Allow GROMACS to see all 4 GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Enable direct GPU to GPU communications
export GMX_ENABLE_DIRECT_GPU_COMM=true

# activate user install of GROMACS 2022
source <user install location>/gromacs-2022/bin/GMXRC

srun gmx_mpi mdrun -s INPUT.tpr -ntomp ${SLURM_CPUS_PER_TASK} -nb gpu -pme gpu -bonded gpu -npme 1

Benchmark results

We investaged the difference between different GPU offloading scenarios and other performance settings for the strong scaling of our benchmark suite.

../_images/benchHBS20k.svg

20k_HBS

../_images/benchMEM3.svg

benchMEM

../_images/benchHBS465k.svg

465k_HBS

../_images/benchRIB3.svg

benchRIB

../_images/benchPEP3.svg

benchPEP

All simulations were run using 1 MPI task per GPU, 12 OpenMP threads per MPI task, and SMT turned off.

The different options are as follows:
  • nb - only non-bonded interactions are offloaded to the GPU.

  • nb, pme, bonded - all interactions (non-bonded, pme, and bonded) are offloaded to GPU.

  • direct GPU comms - the GMX_ENABLE_DIRECT_GPU_COMM flag is set. This is a non-default feature that enables communications between GPUs to done directly between the GPU memory spaces rather than being routed via the CPUs. This can offer performace benefits. It is only possible when using GROMACS internal thread-MPI or if GROMACS was build with a CUDA aware MPI library.

  • cpu_bind - explicit CPU bindings are used with the srun command. The binding command used is srun --cpu_bind=mask_cpu:0xFFF000,0xFFF,0xFFF000000000,0xFFF000000  gmx_mpi     mdrun. This is due to the specific hardware layout of JuwelsBooster.

  • update - The update GPU option is used. This can only be used when the constrains settings are changed from constraints=all-bonds to constraints=h-bonds. (We can only do this for the benchPEP system where we use the benchPEP-h.tpr from the previously referenced webpage).

We see that best performance is obtained when as much as possible is offloaded to the GPU. The explicit CPU binding settings slightly increase performance. The largest extra performace increase occurs when the update gpu option can be set – this results in as much of the calculation as possible being run on the GPU. The direct GPU communications settings also increases performace, however this feature is still in development and sometimes gives errors so output should be checked before using it. The parallel scaling is generally poor when more than 1 node is used. This is because GROMACS 2022 can only have 1 PME GPU rank, for 1 node on JuwelsBooster there are 4 GPUs so 1/4 GPUs are used for PME calculations, this is close to ideal for typical GROMACS simulations. For 2 nodes then 1/8 GPUs are used for PME calculations, this is under-balanced relative to the other 7 GPUs doing PP calculations which have to wait for the PME calculations resulting in poorer parallel performance.

Multiple simulations per Node/ per GPU

Only large systems can fully make use of an entire Juwels Booster node (4 GPUs). For smaller systems It can be beneficial to have multiple simulations per node, or even per GPU to get higher throughput. On Juwels booster you are always charged for a full node.

We have found that you must use the Non-MPI version of gromacs for this. This is because it can be lauched without srun so there is more flexibility in sharing the GPU resources.

Suppose you have a set of 8 folders labelled 1 2 3 4 5 6 7 8. You can run 8 simultaneous simulations, each using 6 threads, and each sharing a GPU with one other simulation. The script below will lauch such a simulation:

#!/bin/bash
#SBATCH --account=<ACCOUNT>
#SBATCH --nodes=1
#SBATCH --ntasks=8           # number of simulations
#SBATCH --hint=nomultithread
#SBATCH --ntasks-per-node=8  # number of simulations
#SBATCH --cpus-per-task=6    # threads per simulation
#SBATCH --time=01:00:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4         # use all 4 GPUs
#SBATCH --cuda-mps           # turn on CUDA MPS to enable multiple processes per GPU




# load the required modules
module load GCC/11.2.0
module load CUDA/11.5

module list

# Allow GROMACS to see all 4 GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3


export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}


# make sure we are using a non-mpi version of gromacs
source <user install location>/gromacs-2022_NOMPI/bin/GMXRC

# input file
TPR=input.tpr


EXE="gmx mdrun"  # Non MPI mdrun

# Set thread mpi to 1 with -ntmpi option
SETTINGS="-s ${TPR} -ntmpi 1 -ntomp ${SLURM_CPUS_PER_TASK} -nb gpu -pme gpu -bonded gpu"


# Options below for 4,8,16,24, or 48 simulations per node.
# Uncomment the desired "folder" option and the corresponding "gputasks" option,
# make sure the rest are commented out and ensure the --ntasks,
# --ntasks-per-node, and --cpus-per-task are set to the corresponding values.


# Currently the 8 simulations option is being used.



# 4 simulations: (make sure ntasks and ntasks-per-node=4 and cpus-per-task=12)
#folders=(1 2 3 4)
#gputasks=(11 00 33 22)

# 8 simulations: (make sure ntasks and ntasks-per-node=8 and cpus-per-task=6)
folders=(1 2 3 4 5 6 7 8)
gputasks=(11 11 00 00 33 33 22 22)

# 16 simulations: (make sure ntasks and ntasks-per-node=16 and cpus-per-task=3)
#folders=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16)
#gputasks=(11 11 11 11 00 00 00 00 33 33 33 33 22 22 22 22)

# 24 simulations: (make sure ntasks and ntasks-per-node=24 and cpus-per-task=2)
#folders=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24)
#gputasks=(11 11 11 11 11 11 00 00 00 00 00 00 33 33 33 33 33 33 22 22 22 22 22 22)

#48 simulations: (make sure ntasks and ntasks-per-node=48 and cpus-per-task=1)
#folders=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47)
#gputasks=(11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00  33 33 33 33 33 33 33 33 33 33 33 33 22 22 22 22 22 22 22 22 22 22 22 22 )





# make cpu bindings to ensure cores are not shared between the simulations
# this works for any of the simulation number options, does not need to be changed
bindings=($( for (( i=0 ; i<48 ; i+=${SLURM_CPUS_PER_TASK} )); do echo $i-$((${i} + ${SLURM_CPUS_PER_TASK} - 1 )); done ))

# This works in conjunction with the gputasks settings which map the process to GPU close to the
# specified GPU.


# lauch all simulations in the specified folders in parallel
for i in ${!folders[@]}
do
    cd ${folders[$i]}
    pwd
    echo ${bindings[$i]}

    # use numactl to bind the simulations to specific cores.
    numactl --physcpubind=${bindings[$i]} ${EXE} ${SETTINGS} -gputasks ${gputasks[$i]} > out.txt 2>&1 &
    cd ..
done

wait

The script also contains options for 4,16,24, or 48 simulations per node.

CUDA MPS in enabled in the script which allows for more efficient GPU sharing.

The folders array contains 8 folder names. The gputasks array contains the GROMACs gpu ID mapping. There are two numbers per simulation for the two GPU task per simualtion: PP and PME. These are the same number so they run on the same GPU. The order is specific to Juwels Boosters setup.

The bindings array evaluates to: (0-5 6-11 12-17 18-23 24-29 30-35 36-41 42-47)

With the gputasks array: (11 11 00 00 33 33 22 22)

This results in the first simulation using cpus 0 to 5 and gpu 1. CPUs 0 to 5 are closest connected to GPU 1 as per the Juwels Booster hardware information: https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html.

The plot below shows the total throughput on one node for multiple simulations. This is for the benchMEM benchmark.

../_images/benchMEM_throughput.svg

BenchMEM – throughput for multiple simulations per Node/ per GPU.

We can see that highest throughput is achived when 16 simulations are run per node (4 simulations per GPU). For smaller systems the optimum number of simulations per node will be larger, for larger systems it will be smaller. The CUDA MPS setting significantly increases the performace, this is because this setting allows mutliple process to better share a single GPU.

GROMACS Reference Benchmarks Performance on PRACE/EuroHPC machines

This section provides a reference set of benchmark simulation performance results representative of good obtainable performance on PRACE/EuroHPC machines with GROMACS built and run according to best practice outlined in this guide.

These results are intended as a convenient reference to help researchers estimate compute time requirements for their proposed research in preparation for applying to HPC resource allocation calls.

[Table showing ns/day and walltime hours/ns with good performance (PME-tuned, dlb on, etc.) for above benchmarks on these machines, i.e. summary of good achievable building on above results to optimise]