plumed2 icon indicating copy to clipboard operation
plumed2 copied to clipboard

Hamiltonian replica exchange simulation might hang at gromacs checkpoint writing

Open shazj99 opened this issue 3 years ago • 2 comments

Summary While running replica exchange multi-simulation, gmx_mpi processes will hang.

GROMACS version 2021.5-plumed-2.7.2

Steps to reproduce cd gromacs-test mpirun -np 4 gmx_mpi mdrun -v -ntomp 12 -cpt 1 --deffnm lambda -plumed plumed.dat -hrex -replex 100 -nb gpu -bonded gpu -pme gpu -multidir lambda0 lambda1 lambda2 lambda3 -gpu_id 01

gromacs-test.tar.gz

(Gromacs compile options: cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=CUDA -DCUDA_cufft_LIBRARY=/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcufft.so.10 -DREGRESSIONTEST_DOWNLOAD=OFF -DCMAKE_SOURCE_DIR=/usr/local/cuda-11.0/targets/x86_64-linux/include -DGMX_MPI=ON)

What is the current bug behavior? The simulation might hang if some ranks are writing checkpoint. Attaching a debugger and getting the stack trace below (i.e. gdb -p PID, then type "bt"):

#0  0x00007f953708a4e0 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#1  0x00007f95370791aa in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#2  0x00007f9536f5e35b in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#3  0x00007f9536fd9456 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#4  0x00007f9536fda1fc in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#5  0x00007f9536f924c7 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#6  0x00007f9536eeb921 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#7  0x00007f9536eeba9d in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#8  0x00007f9536f9260c in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#9  0x00007f9536eeb9f3 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#10 0x00007f9536eeba9d in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#11 0x00007f9536eebbcb in PMPI_Barrier () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#12 0x00007f9537ded120 in write_checkpoint(char const*, bool, _IO_FILE*, t_commrec const*, int*, int, int, int, bool, int, long, double, t_state*, ObservablesHistory*, gmx::MdModulesNotifier const&, gmx::WriteCheckpointDataHolder*, bool, int) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#13 0x00007f9537dedba3 in mdoutf_write_checkpoint(gmx_mdoutf*, _IO_FILE*, t_commrec const*, long, double, t_state*, ObservablesHistory*, gmx::WriteCheckpointDataHolder*) ()
   from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#14 0x00007f9537dede54 in mdoutf_write_to_trajectory_files(_IO_FILE*, t_commrec const*, gmx_mdoutf*, int, int, long, double, t_state*, t_state*, ObservablesHistory*, gmx::ArrayRef<gmx::BasicVector<float> const>, gmx::WriteCheckpointDataHolder*) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#15 0x00007f9537e1acff in do_md_trajectory_writing(_IO_FILE*, t_commrec*, int, t_filenm const*, long, long, double, t_inputrec*, t_state*, t_state*, ObservablesHistory*, gmx_mtop_t const*, t_forcerec*, gmx_mdoutf*, gmx::EnergyOutput const&, gmx_ekindata_t*, gmx::ArrayRef<gmx::BasicVector<float> const>, bool, bool, bool, bool, bool) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#16 0x00007f9537f17a7d in gmx::LegacySimulator::do_md() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#17 0x00007f9537f1558d in gmx::LegacySimulator::run() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#18 0x00007f9537f4f73c in gmx::Mdrunner::mdrunner() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#19 0x000055bf81b9231c in gmx::gmx_mdrun(int, gmx_hw_info_t const&, int, char**) ()
#20 0x000055bf81b92417 in gmx::gmx_mdrun(int, char**) ()
#21 0x00007f95378acde2 in gmx::CommandLineModuleManager::run(int, char**) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#22 0x000055bf81b9088c in main ()
#0  0x00007f74b07fb046 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#1  0x00007f74b06e035b in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#2  0x00007f74b06d374e in PMPI_Recv () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#3  0x00007f74b16bde44 in exchange_rvecs(gmx_multisim_t const*, int, float (*) [3], int) [clone .isra.4] [clone .part.5] [clone .constprop.63] () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#4  0x00007f74b16bf227 in exchange_state(gmx_multisim_t const*, int, t_state*) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#5  0x00007f74b169cbd9 in gmx::LegacySimulator::do_md() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#6  0x00007f74b169758d in gmx::LegacySimulator::run() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#7  0x00007f74b16d173c in gmx::Mdrunner::mdrunner() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#8  0x000056080e98831c in gmx::gmx_mdrun(int, gmx_hw_info_t const&, int, char**) ()
#9  0x000056080e988417 in gmx::gmx_mdrun(int, char**) ()
#10 0x00007f74b102ede2 in gmx::CommandLineModuleManager::run(int, char**) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#11 0x000056080e98688c in main ()

These processes are hanging at PMPI_Barrier and PMPI_Recv, might be deadlock. I found the 2nd stack is related with plumed patch. The command option '-hrex' will trigger it and call exchange_state().

shazj99 avatar Jun 07 '22 03:06 shazj99

Thanks! This is the same as #742

The report on stack tracing is very useful. I will try to have a look at this in the next couple of weeks.

GiovanniBussi avatar Jun 07 '22 06:06 GiovanniBussi

Hi @GiovanniBussi ,

After diving into the codes, I think I found the root cause. It can be triggered as following: suppose there're 4 replicas, and replica #1 and #2 have done replica exchange at step X, and then they're going to writing checkpoint which will call PMPI_Barrier in write_checkpoint() to wait all other replicas to write at the same step. But since replica #0 or #3 do not do exchange at this round, the afterwards checking checkpointHandler->decideIfCheckpointingThisStep will fail and make them run forward to next exchange step(X+replex) which will wait on PMPI_Recv by exchange_state().

I think it can be fixed by passing the previous bDoReplEx value to decideIfCheckpointingThisStep: If some replicas decide to write checkpoint by bExchanged, the other ones should also manage to do so. I paste the changes as below and if you are agree with me, I'll send a new PR for it.

Thanks.

--- md.cpp.orig	2022-06-08 00:22:11.286821932 +0800
+++ md.cpp	2022-06-08 00:28:41.643920122 +0800
@@ -177,7 +177,7 @@
     gmx_repl_ex_t     repl_ex = nullptr;
     gmx_global_stat_t gstat;
     gmx_shellfc_t*    shellfc;
-    gmx_bool          bSumEkinhOld, bDoReplEx, bExchanged, bNeedRepartition;
+    gmx_bool          bSumEkinhOld, bDoReplEx, bDoReplExPrev, bExchanged, bNeedRepartition;
     gmx_bool          bTemp, bPres, bTrotter;
     real              dvdl_constr;
     std::vector<RVec> cbuf;
@@ -693,6 +693,7 @@
     bSumEkinhOld     = FALSE;
     bExchanged       = FALSE;
     bNeedRepartition = FALSE;
+    bDoReplEx        = FALSE;

     step     = ir->init_step;
     step_rel = 0;
@@ -760,6 +761,7 @@
                            && (!bFirstStep));
         }

+        bDoReplExPrev = bDoReplEx;
         bDoReplEx = (useReplicaExchange && (step > 0) && !bLastStep
                      && do_per_step(step, replExParams.exchangeInterval));

@@ -873,7 +875,7 @@
         }
         clear_mat(force_vir);

-        checkpointHandler->decideIfCheckpointingThisStep(bNS, bFirstStep, bLastStep);
+        checkpointHandler->decideIfCheckpointingThisStep(bNS||bDoReplExPrev, bFirstStep, bLastStep);

         /* Determine the energy and pressure:
          * at nstcalcenergy steps and at energy output steps (set below).

shazj99 avatar Jun 07 '22 17:06 shazj99

Closed by #831

GiovanniBussi avatar Sep 16 '22 13:09 GiovanniBussi