PRACTICAL: Benchmarking Molecular Dynamics Using GROMACS 2
Overview
Teaching: 10 min
Exercises: 20 minQuestions
How do we run hybrid MPI and OpenMP jobs on MareNostrum?
Does adding OpenMP to MPI GROMACS performance?
Objectives
Run a hybrid MPI with OpenMP simuation on MareNostrum.
See how GROMACS performance is changed by including OpenMP.
Aims
GROMACS can run in parallel using MPI and simultaneously also using
OpenMP threads (see here).
In this tutorial, you will learn how to run hybrid MPI and OpenMP jobs on
MareNostrum, and you will benchmark the performance of the bechMEM
system to see
whether performance improves when using OpenMP threads.
Hybrid MPI and OpenMP jobs on MareNostrum
When running hybrid MPI (with the individual tasks also known as ranks or
processes) and OpenMP (with multiple threads) jobs you need to leave free
cores between the parallel tasks launched using srun
for the multiple
OpenMP threads that will be associated with each MPI task.
As we saw above, you can use the options to sbatch
to control how many
parallel tasks are placed on each compute node and can use the
--cpus-per-task
option to set the stride between parallel tasks
to the right value to accommodate the OpenMP threads. The value
of --cpus-per-task
should usually be the same as that for
OMP_NUM_THREADS
.
As an example, consider the job script below that runs across 1 nodes with 24 MPI tasks per node and 2 OpenMP threads per MPI task (so all 48 cores are used).
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=0:5:0
#SBATCH --nodes=1
#SBATCH --cpus-per-task=24
#SBATCH --tasks-per-node=2
module load intel/2020.1
module load impi/2018.4
module load mkl/2020.1
module load gcc/9.2.0
module load gromacs/2021.2
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun gmx_mpi mdrun -ntomp ${SLURM_CPUS_PER_TASK} -s benchMEM.tpr
Each MareNostrum compute node is made up of 2 sockets, each with 24 cores. Programs where the threads span both sockets are likely to be less efficient as the primary memory available to these threads will be spread between both sockets.
Instructions
For this tutorial, you will start by comparing the performance of a simulation that uses all of the cores on a node. Using the code above as a template, try running simulations that use varying levels of MPI and OpenMP (making sure that the number of MPI ranks and OpenMP threads always multiply to 128 or less). How do the simulation times change as you increase the numbers change? How do these times change if you do not spread the threads over the NUMA regions as suggested?
You may find it helpful to fill out this table
MPI Ranks | OpenMP threads | Walltime (s) | performance (ns/day) |
---|---|---|---|
48 | 1 | ||
24 | 2 | ||
16 | 3 | ||
12 | 4 | ||
9 | 5 | ||
8 | 6 | ||
4 | 12 | ||
2 | 24 |
Load balancing
GROMACS performs dynamic load balancing when it deems necessary. Can you tell from your md.log files so far whether it has been doing so, and what it calculated the load imbalance was before deciding to do so?
To demonstrate the effect of the load imbalance counteracted by GROMACS’s
dynamic load balancing scheme, investigate what happens when this is turned
off by including the -dlb no
option to gmx_mpi mdrun
. How does removing
load balancing affect the simulation runtime? Does this effect change as the
number of cores used increases?
Key Points
Hybrid MPI with OpenMP does affect performance.
When running hybrid jobs, placement across NUMA regions is important.