BioSimSpace
BioSimSpace copied to clipboard
Improving GROMACS performance
Default settings for a 200 ps equilibration simulation with a system of 50k atoms takes more than 24 hours to complete on a typical GPU cluster with an optimised Gromacs installation for Binding free energy calculations.
Just to clarify, you mention binding free energy calculations, but are seeing poor performance solely for an equilibration?
Thoughts / questions below:
- Could you post a Gromacs log file for the equlibration. Hardware detection information will be at the top, timing statistics at the bottom. Note that Gromacs tries to optimise certain options depending on what hardware it detects, which might end up being sub-optimal if it gets things wrong (see below).
- Was this simulation run on your cluster while the other nodes were active? If so, could you try running it as the only job on the cluster. During the workshop week it was apparent that
gmxwas detecting the resources of the whole cluster, rather than the individual Jupyter server. As such, Gromacs processes tried to grab too many resources and ended up running far more slowly than expected. (This was apprarent when runningtop, which showed a CPU load in the 1000s of percent.) - If this is for an equlibration containing a perturbable molecule, could you possibly run an equilibration for a similar sized system that only contains regular molecules. (Perhaps using the same protein and one of the two ligands, solvated in the same size box.) I wonder if something funny is going on with dummy atoms, or the additional properties for lambda = 1 (which should be redundant).
I'll update this comment if I come up with any more ideas.
(I would try running on our cluster here, but it isn't optimised for GPU simulations since we can't enable certain kernel features that allow acceleration through overclocking. When I tried tweaking options and command-line parameters to improve the speed of the ethane-methanol simulations I saw no improvement regardless of what I tried.)
The cluster is being used now, but I can see what I can post.
Equilibration run (attached as equib.tar.gz)
Command line:
gmx mdrun -v -deffnm md -nb gpu -gpu_id 0 -nt 8
GROMACS version: 2019.1
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 7.3.0
C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/bin/c++ GNU 7.3.0
C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/local/cuda-9.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on Tue_Jun_12_23:07:04_CDT_2018;Cuda compilation tools, release 9.2, V9.2.148
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.10
CUDA runtime: 9.20
Running on 1 node with total 16 cores, 32 logical cores, 2 compatible GPUs
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Family: 6 Model: 79 Stepping: 1
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 16] [ 1 17] [ 2 18] [ 3 19] [ 4 20] [ 5 21] [ 6 22] [ 7 23]
Socket 1: [ 8 24] [ 9 25] [ 10 26] [ 11 27] [ 12 28] [ 13 29] [ 14 30] [ 15 31]
GPU info:
Number of GPUs detected: 2
#0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible
#1: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible
[...]
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 5990.760160 53916.841 0.0
NxN Ewald Elec. + LJ [F] 6007635.399744 396503936.383 97.9
NxN Ewald Elec. + LJ [V&F] 60743.534464 6499558.188 1.6
1,4 nonbonded interactions 1237.912379 111412.114 0.0
Shift-X 51.029979 306.180 0.0
Bonds 243.502435 14366.644 0.0
Angles 859.408594 144380.644 0.0
Propers 1508.515085 345449.954 0.1
Impropers 98.200982 20425.804 0.0
Virial 51.075024 919.350 0.0
Update 5097.950979 158036.480 0.0
Stop-CM 51.080958 510.810 0.0
Calc-Ekin 102.059958 2755.619 0.0
Lincs 465.409308 27924.558 0.0
Lincs-Mat 2390.447808 9561.791 0.0
Constraint-V 10186.552796 81492.422 0.0
Constraint-Vir 48.653605 1167.687 0.0
Settle 3085.261704 996539.530 0.2
-----------------------------------------------------------------------------
Total 404972661.001 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 8 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 8 1001 4.651 78.142 2.7
Launch GPU ops. 1 8 200002 9.354 157.143 5.5
Force 1 8 100001 14.120 237.211 8.3
Wait PME GPU gather 1 8 100001 20.025 336.417 11.7
Reduce GPU PME F 1 8 100001 2.262 38.009 1.3
Wait GPU NB local 1 8 100001 39.169 658.038 22.9
NB X/F buffer ops. 1 8 199001 18.270 306.940 10.7
Write traj. 1 8 201 1.438 24.166 0.8
Update 1 8 200002 33.657 565.435 19.7
Constraints 1 8 200004 21.894 367.823 12.8
Rest 6.048 101.599 3.5
-----------------------------------------------------------------------------
Total 170.888 2870.923 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 1367.041 170.888 800.0
(ns/day) (hour/ns)
Performance: 50.560 0.475
Finished mdrun on rank 0 Sat May 25 23:22:20 2019
lambda=0.00 run same system, but running a perturbed molecule:
GROMACS: gmx mdrun, version 2019.1
Executable: /export/users/common/Gromacs19.1/bin/gmx
Data prefix: /export/users/common/Gromacs19.1
Working dir: /export/users/ppxasjsm/Projects/Tyk2/BSS/GROMACS/6340/TYK2_17_8/bound/lambda_0.0000
Process ID: 12141
Command line:
gmx mdrun -v -deffnm gromacs -nb gpu -gpu_id 0 -nt 8
GROMACS version: 2019.1
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 7.3.0
C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/bin/c++ GNU 7.3.0
C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/local/cuda-9.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on Tue_Jun_12_23:07:04_CDT_2018;Cuda compilation tools, release 9.2, V9.2.148
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.10
CUDA runtime: 9.20
Running on 1 node with total 16 cores, 32 logical cores, 2 compatible GPUs
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Family: 6 Model: 79 Stepping: 1
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 16] [ 1 17] [ 2 18] [ 3 19] [ 4 20] [ 5 21] [ 6 22] [ 7 23]
Socket 1: [ 8 24] [ 9 25] [ 10 26] [ 11 27] [ 12 28] [ 13 29] [ 14 30] [ 15 31]
GPU info:
Number of GPUs detected: 2
#0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible
#1: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 8 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 8 30001 342.311 5750.826 1.7
Launch GPU ops. 1 8 3000001 288.609 4848.619 1.4
Force 1 8 3000001 7842.374 131751.752 38.9
PME mesh 1 8 3000001 9103.332 152935.835 45.2
Wait Bonded GPU 1 8 30001 0.149 2.505 0.0
Wait GPU NB local 1 8 3000001 51.893 871.810 0.3
NB X/F buffer ops. 1 8 5970001 484.335 8136.822 2.4
Write traj. 1 8 6018 39.080 656.539 0.2
Update 1 8 6000002 1231.074 20682.030 6.1
Constraints 1 8 6000004 650.709 10931.903 3.2
Rest 123.452 2073.993 0.6
-----------------------------------------------------------------------------
Total 20157.319 338642.633 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME spread 1 8 6000002 3330.029 55944.432 16.5
PME gather 1 8 6000002 1925.679 32351.369 9.6
PME 3D-FFT 1 8 12000004 3380.822 56797.759 16.8
PME solve Elec 1 8 6000002 449.137 7545.490 2.2
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 161258.471 20157.319 800.0
5h35:57
(ns/day) (hour/ns)
Performance: 12.859 1.866
Finished mdrun on rank 0 Sun May 26 05:20:04 2019
Slurm submission file:
#!/bin/bash -login
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --job-name=NAME
#SBATCH --output=NAME.out
#SBATCH --time=7-00:00:00
#SBATCH -p gpu
#SBATCH --gres=gpu:2
#SBATCH --gres-flags=enforce-binding
# Disable Sire analytics.
export SIRE_DONT_PHONEHOME=1
export SIRE_SILENT_PHONEHOME=1
# Make sure nvcc is in the path.
export PATH=/usr/local/cuda-9.2/bin:$PATH
# Set path to local AmberTools installation.
#export AMBERHOME=/mnt/shared/software/amber18
# Source the GROMACS shell rc, making sure mount point exists.
#while [ ! -f /mnt/shared/software/gromacs/bin/GMXRC ]; do
# sleep 1s
#done
#source /mnt/shared/software/gromacs/bin/GMXRC
# Set the OpenMM plugin directory.
export OPENMM_PLUGIN_DIR=/export/users/ppxasjsm/miniconda3/lib/plugins
# Make a unique directory for this job and move to it.
mkdir $SLURM_SUBMIT_DIR/$SLURM_JOB_ID
cd $SLURM_SUBMIT_DIR/$SLURM_JOB_ID
export JOB_DIR=$SLURM_SUBMIT_DIR
# Run the script using the BioSimSpace python interpreter.
# Make sure GPU ID 0 is first.
# Forwards.
time /export/users/ppxasjsm/miniconda3/bin/sire_python --ppn=8 $JOB_DIR/binding_freenrg_gmx.py LIG0 LIG1 0 &
# Make sure GPU ID 1 is first.
export CUDA_VISIBLE_DEVICES=1,0
# Backwards.
time /export/users/ppxasjsm/miniconda3/bin/sire_python --ppn=8 $JOB_DIR/binding_freenrg_gmx.py LIG1 LIG0 1
wait
~
Finished there slightly too early.
If I run the same script, but do not update the gromacs commandline arguments, i.e.
gmx mdrun -v -deffnm
and not add:
d = OrderedDict([('mdrun', True), ('-v', True), ('-deffnm', 'gromacs'), ('-nb', 'gpu'), ('-gpu_id', num3), ('-nt', 8)])
freenrg._update_run_args(d)
I get the following performance:
Command line:
gmx mdrun -v -deffnm md
Back Off! I just backed up md.log to ./#md.log.2#
Reading file md.tpr, VERSION 2019.1 (single precision)
Changing nstlist from 10 to 100, rlist from 1.2 to 1.294
Using 8 MPI threads
Using 4 OpenMP threads per tMPI thread
On host node01 4 GPUs selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
PP:0,PP:0,PP:1,PP:1,PP:2,PP:2,PP:3,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
Back Off! I just backed up md.trr to ./#md.trr.2#
Back Off! I just backed up md.edr to ./#md.edr.2#
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'BioSimSpace System'
100000 steps, 100.0 ps.
step 200: timed with pme grid 72 72 72, coulomb cutoff 1.200: 19205.2 M-cycles
step 400: timed with pme grid 60 60 60, coulomb cutoff 1.333: 21291.8 M-cycles
^C
Received the INT signal, stopping within 200 steps
step 600: timed with pme grid 52 52 52, coulomb cutoff 1.538: 20658.8 M-cycles
Dynamic load balancing report:
DLB was locked at the end of the run due to unfinished PP-PME balancing.
Average load imbalance: 3.1%.
The balanceable part of the MD step is 42%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 1.3%.
Core t (s) Wall t (s) (%)
Time: 2308.998 72.164 3199.6
(ns/day) (hour/ns)
Performance: 0.839 28.596
GROMACS reminds you: "A Pretty Village Burning Makes a Pretty Fire" (David Sandstrom)
Interesting, I tried explicitly adding -nb gpu on our cluster but it made no difference. (Perhaps it won't for a small system.) When I looked at this page it seems that there are a bunch of things that you can set to auto, cpu, or gpu, such as nb. By default, it is set to auto and should use a compatible GPU if found. It seems stupid that you get better performance by setting this explicitly, since the log clearly states that it has found a compatible GPU, so should use it for the non-bonded calculation! Do you know if you get even better performance by enabling the gpu option for other calculations, such as bonded and pme, or is the non-bonded calculation the real bottleneck?
Looking at the output above...
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
... it looks like it has only chosen to do short-ranged and bonded interactions on the GPU.
I notice that you also explicitly set the gpu_id. Is this needed to get good performance, or does GROMACS not autodetect things correctly? I've not done this myself and it still has found the correct GPU that is available on the node. (Perhaps you were doing this for the forward and reverse simulations.)
I also notice that you set the number of threads with -nt 8. When you don't do this, it looks like GROMACS still sets things correctly:
Using 8 MPI threads
Using 4 OpenMP threads per tMPI thread
I'm a little confused by the second output above (with the unmodified arguments) since it seems to be missing some info at the start of the log, e.g. the GROMACS version info and detected hardware.
For reference, here are the relevant sections from a GROMACS log for one of the free legs of an ethane-methanol perturbation on BlueCrystal 4:
GROMACS: gmx mdrun, version 2018
Executable: /mnt/storage/software/apps/GROMACS-2018-MPI-GPU-Intel-2017/bin/gmx_mpi
Data prefix: /mnt/storage/software/apps/GROMACS-2018-MPI-GPU-Intel-2017
Working dir: /mnt/storage/scratch/lh17146/solvation_freenrgy/ethane_methanol/free/lambda_0.0000
Command line:
gmx_mpi mdrun -v -deffnm gromacs
GROMACS version: 2018
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.5-fma-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2018-10-02 11:51:51
Built by: [email protected] [CMAKE]
Build OS/arch: Linux 3.10.0-514.10.2.el7.x86_64 x86_64
Build CPU vendor: Intel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Build CPU family: 6 Model: 79 Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /mnt/storage/apps/intel/impi/2017.1.132/bin64/mpiicc Intel 17.0.1.20161005
C compiler flags: -march=core-avx2 -O3 -xHost -ip -no-prec-div -static-intel -std=gnu99 -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits
C++ compiler: /mnt/storage/apps/intel/impi/2017.1.132/bin64/mpiicpc Intel 17.0.1.20161005
C++ compiler flags: -march=core-avx2 -O3 -xHost -ip -no-prec-div -static-intel -std=c++11 -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits
CUDA compiler: /mnt/storage/software/libraries/nvidia/cuda-9.0/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on Fri_Sep__1_21:08:03_CDT_2017;Cuda compilation tools, release 9.0, V9.0.176
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;; ;-march=core-avx2;-O3;-xHost;-ip;-no-prec-div;-static-intel;-std=c++11;-O3;-DNDEBUG;-ip;-funroll-all-loops;-alias-const;-ansi-alias;-no-prec-div;-fimf-domain-exclusion=14;-qoverride-limits;
CUDA driver: 9.10
CUDA runtime: 9.0
Running on 1 node with total 28 cores, 28 logical cores, 1 compatible GPU
Hardware detected on host gpu22.bc4.acrc.priv (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Family: 6 Model: 79 Stepping: 1
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13]
Socket 1: [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] [ 24] [ 25] [ 26] [ 27]
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible
...
Using 1 MPI process
Using 28 OpenMP threads
1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
PP:0
NOTE: GROMACS was configured without NVML support hence it can not exploit
application clocks of the detected Tesla P100-PCIE-16GB GPU to improve performance.
Recompile with the NVML library (compatible with the driver used) or set application clocks manually.
...
P P - P M E L O A D B A L A N C I N G
PP/PME load balancing changed the cut-off and PME settings:
particle-particle PME
rcoulomb rlist grid spacing 1/beta
initial 1.200 nm 1.205 nm 25 25 25 0.120 nm 0.384 nm
final 1.200 nm 1.205 nm 25 25 25 0.120 nm 0.384 nm
cost-ratio 1.00 1.00
(note that these numbers concern only part of the total PP and PME load)
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
NB Free energy kernel 751541.952726 751541.953 0.6
Pair Search distance check 1664.380960 14979.429 0.0
NxN Ewald Elec. + LJ [F] 1763797.692096 116410647.678 93.5
NxN Ewald Elec. + LJ [V&F] 17819.671104 1906704.808 1.5
1,4 nonbonded interactions 5.040117 453.611 0.0
Calc Weights 3967.507935 142830.286 0.1
Spread Q Bspline 169280.338560 338560.677 0.3
Gather F Bspline 169280.338560 1015682.031 0.8
3D-FFT 435338.844320 3482710.755 2.8
Solve PME 624.981650 39998.826 0.0
Shift-X 13.227645 79.366 0.0
Bonds 0.560013 33.041 0.0
Angles 6.420096 1078.576 0.0
Propers 4.680045 1071.730 0.0
Virial 134.502690 2421.048 0.0
Update 1322.502645 40997.582 0.0
Stop-CM 13.230290 132.303 0.0
P-Coupling 132.252645 793.516 0.0
Calc-Ekin 264.505290 7141.643 0.0
Lincs 6.000024 360.001 0.0
Lincs-Mat 72.000288 288.001 0.0
Constraint-V 2649.007947 21192.064 0.0
Constraint-Vir 132.152643 3171.663 0.0
Settle 879.003516 283918.136 0.2
-----------------------------------------------------------------------------
Total 124466788.723 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 28 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 28 5001 9.126 613.290 0.7
Launch GPU ops. 1 28 500001 25.700 1727.034 2.0
Force 1 28 500001 855.686 57502.194 67.6
PME mesh 1 28 500001 208.135 13986.672 16.4
Wait GPU NB local 1 28 500001 4.359 292.954 0.3
NB X/F buffer ops. 1 28 995001 99.815 6707.581 7.9
Write traj. 1 28 1002 1.018 68.387 0.1
Update 1 28 1000002 21.893 1471.195 1.7
Constraints 1 28 1000002 25.935 1742.839 2.0
Rest 14.493 973.919 1.1
-----------------------------------------------------------------------------
Total 1266.160 85086.066 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME spread 1 28 1000002 77.285 5193.538 6.1
PME gather 1 28 1000002 52.921 3556.271 4.2
PME 3D-FFT 1 28 2000004 70.670 4749.053 5.6
PME solve Elec 1 28 1000002 3.495 234.855 0.3
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 35452.480 1266.160 2800.0
(ns/day) (hour/ns)
Performance: 68.238 0.352
Finished mdrun on rank 0 Wed Mar 13 09:43:37 2019
As you can see, GROMACS correctly detected the CPUs and GPU on the node without needing additional command-line arguments. The only concern that it raises is the lack of NVML support, but this can't be used on our cluster anyway. For this system, I see no improvement in performance if I set -nb gpu.
I also noticed that our GROMACS version isn't compiled to use thread_mpi as its MPI library. According to this, I shouldn't expect to get as good single node performance as you.
So I see better performance if I only use one GPU rather than 4. I am not sure setting the gpu_id is necessary I was just playing around with that option when I hadn't figured out to restrict the visible GPUs with a slurm script.
You get 68 ns/day on Blue crystal for non perturbed equilibrations?
No, my results are for the actual lambda = 0 stage of the free leg, so it's not a direct comparison. I was just showing how GROMACS auto hardware detection seemed to work for me. I've not got data for the equilibration part, since it was run in a temporary working directory.
I'd be happy to test performance here if you give me your input files, although BC4 seems to be totally unresponsive today.
Changed to a more appropriate issue title. From poking around it seems like this isn't a specific BioSimSpace problem. We should use this thread to debug and document reliable ways of getting good Gromacs performance.
for a bound FEP simulation with gromacs/20.4 (TYK2 in a 20nm waterbox) I was seeing similar behaviour on our cluster (presumably the same as @ppxasjsm is referring to). What helped in my case was using MPIRUN with a single copy, i.e.
mpirun -np 1 gmx mdrun -v -deffnm gromacs 1> gromacs.log 2> gromacs.err
while also supplying the slurm job with a single GPU. Any other configuration ended up oversubscribing CPU on our nodes. Finished simulation output:
step 2000000, remaining wall clock time: 0 s
Core t (s) Wall t (s) (%)
Time: 587682.841 18365.100 3200.0
5h06:05
(ns/day) (hour/ns)
Performance: 18.818 1.275
Update Gromacs to the latest version might help, from its release note:
Free-energy kernels are accelerated using SIMD, which make free-energy calculations up to three times as fast when using GPUs.
My output of a system with 54016 atoms on A100 GPU:
Equilibration:
Core t (s) Wall t (s) (%)
Time: 5329.276 231.708 2300.0
(ns/day) (hour/ns)
Performance: 74.577 0.322
Production (free energy=yes):
Core t (s) Wall t (s) (%)
Time: 200244.094 8706.265 2300.0
2h25:06
(ns/day) (hour/ns)
Performance: 39.696 0.605
Thanks for reporting, that's good to know. Since we don't bundle a version of GROMACS it's hard to provide settings that are optimised for any version (and hardware environment). I'm glad the free energy kernels are improving, though.