fix hamiltonian replica exchange simulation hang issue at checkpoint …
Description
This patch is trying to fix issue #829 :
While doing hamiltonian replica exchange multi-simulation, some ranks may hang if some of them are writing checkpoints. Attaching a debugger and can get the stack trace below:
#0 0x00007f953708a4e0 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#1 0x00007f95370791aa in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#2 0x00007f9536f5e35b in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#3 0x00007f9536fd9456 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#4 0x00007f9536fda1fc in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#5 0x00007f9536f924c7 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#6 0x00007f9536eeb921 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#7 0x00007f9536eeba9d in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#8 0x00007f9536f9260c in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#9 0x00007f9536eeb9f3 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#10 0x00007f9536eeba9d in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#11 0x00007f9536eebbcb in PMPI_Barrier () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#12 0x00007f9537ded120 in write_checkpoint(char const*, bool, _IO_FILE*, t_commrec const*, int*, int, int, int, bool, int, long, double, t_state*, ObservablesHistory*, gmx::MdModulesNotifier const&, gmx::WriteCheckpointDataHolder*, bool, int) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#13 0x00007f9537dedba3 in mdoutf_write_checkpoint(gmx_mdoutf*, _IO_FILE*, t_commrec const*, long, double, t_state*, ObservablesHistory*, gmx::WriteCheckpointDataHolder*) ()
from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#14 0x00007f9537dede54 in mdoutf_write_to_trajectory_files(_IO_FILE*, t_commrec const*, gmx_mdoutf*, int, int, long, double, t_state*, t_state*, ObservablesHistory*, gmx::ArrayRef<gmx::BasicVector<float> const>, gmx::WriteCheckpointDataHolder*) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#15 0x00007f9537e1acff in do_md_trajectory_writing(_IO_FILE*, t_commrec*, int, t_filenm const*, long, long, double, t_inputrec*, t_state*, t_state*, ObservablesHistory*, gmx_mtop_t const*, t_forcerec*, gmx_mdoutf*, gmx::EnergyOutput const&, gmx_ekindata_t*, gmx::ArrayRef<gmx::BasicVector<float> const>, bool, bool, bool, bool, bool) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#16 0x00007f9537f17a7d in gmx::LegacySimulator::do_md() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#17 0x00007f9537f1558d in gmx::LegacySimulator::run() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#18 0x00007f9537f4f73c in gmx::Mdrunner::mdrunner() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#19 0x000055bf81b9231c in gmx::gmx_mdrun(int, gmx_hw_info_t const&, int, char**) ()
#20 0x000055bf81b92417 in gmx::gmx_mdrun(int, char**) ()
#21 0x00007f95378acde2 in gmx::CommandLineModuleManager::run(int, char**) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#22 0x000055bf81b9088c in main ()
#0 0x00007f74b07fb046 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#1 0x00007f74b06e035b in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#2 0x00007f74b06d374e in PMPI_Recv () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#3 0x00007f74b16bde44 in exchange_rvecs(gmx_multisim_t const*, int, float (*) [3], int) [clone .isra.4] [clone .part.5] [clone .constprop.63] () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#4 0x00007f74b16bf227 in exchange_state(gmx_multisim_t const*, int, t_state*) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#5 0x00007f74b169cbd9 in gmx::LegacySimulator::do_md() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#6 0x00007f74b169758d in gmx::LegacySimulator::run() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#7 0x00007f74b16d173c in gmx::Mdrunner::mdrunner() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#8 0x000056080e98831c in gmx::gmx_mdrun(int, gmx_hw_info_t const&, int, char**) ()
#9 0x000056080e988417 in gmx::gmx_mdrun(int, char**) ()
#10 0x00007f74b102ede2 in gmx::CommandLineModuleManager::run(int, char**) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#11 0x000056080e98688c in main ()
The bug can be triggered as following: suppose there're 4 replicas, and replica No.1 and No.2 have done replica exchange at step X, and then they're going to writing checkpoint which will call PMPI_Barrier in write_checkpoint() to wait all other replicas to write at the same step. But since replica No.0 or No.3 do not do exchange at this round, the afterwards checking checkpointHandler->decideIfCheckpointingThisStep() will fail and make them run forward to next exchange step(X+replex) which will wait on PMPI_Recv by exchange_state().
It can be fixed by passing the previous bDoReplEx value to decideIfCheckpointingThisStep: If some replicas decide to write checkpoint by bExchanged, the other ones should also manage to do so.
Target release
I would like my code to appear in release v2.8
Type of contribution
- [ ] changes to code or doc authored by PLUMED developers, or additions of code in the core or within the default modules
- [X] changes to a module not authored by you
- [ ] new module contribution or edit of a module authored by you
Copyright
- [X] I agree to transfer the copyright of the code I have written to the PLUMED developers or to the author of the code I am modifying.
- [ ] the module I added or modified contains a
COPYRIGHTfile with the correct license information. Code should be released under an open source license. I also used the commandcd src && ./header.sh mymodulenamein order to make sure the headers of the module are correct.
Tests
- [ ] I added a new regtest or modified an existing regtest to validate my changes.
- [ ] I verified that all regtests are passed successfully on GitHub Actions.