plumed2 icon indicating copy to clipboard operation
plumed2 copied to clipboard

HREX with plumed 2.7.2 + gromacs 2020/2021, stalls at random checkpoint writing

Open simonlichtinger opened this issue 4 years ago • 3 comments

Dear plumed dev team,

While trying to run REST2 via hrex, I'm having the following issue.

At a random point (anything between 300ps to 3ns I have observed) into the simulation, gromacs stalls on writing a set of checkpoint files. This means that the main task is still running but drops to very low CPU usage (and output via the -v flag freezes), some replicas have already written the new checkpoint file while others haven't. Notably, this happens after several checkpoint updates have already succeeded.

I've tried troubleshooting with:

  • Testing different gromacs versions (occurs for 2020.4 and 2021.3)
  • Making sure there is enough disk space (there is)
  • Running on different architectures (occurs for my local machine - 2xGTX 3060, CUDA and AVX_512, as well as a cluster - Tesla V100, CUDA and IBM_VSX)
  • Different openmpi versions, occurs for 4.0.5 and 4.1.1
  • Different plumed versions (can only use 2.7.2, as hrex will fail at first exchange attempt with mpi error in versions 2.7.1 or 2.6)

I'm invoking gromacs via mpirun -np 4 gmx_mpi mdrun -v -deffnm topol -multidir run* -replex 100 -hrex -plumed plumed.dat (with an empty plumed.dat file).

Is this an issue you are aware of? Might you have any idea what causes it?

Many thanks Simon

simonlichtinger avatar Oct 04 '21 15:10 simonlichtinger

Hi, I am having the same problem (gromacs 2021.3 plumed 2.7.2)

If I force gromacs to never generate checkpoint files (by using -cpt -1 in mdrun) the jobs arrive to the end without problems. (But of course this "solution" is not feasible for MD runs that take longer than the maximum wall time of ones HPC cluster)

MauriceKarrenbrock avatar Oct 12 '21 09:10 MauriceKarrenbrock

Hi, I have the same problem. (version 2021.4-plumed-2.7.3) CUDA 11.2 mpich 3.4.3 GPU: Tesla V100 After writing the checkpoint file, it hangs. Once it happens, the generated cpt file will have a file name like xxxx_step154480.cpt. But the name given by -cpi is xxxx.cpt. Unable to restart with problematic cpt file. I'm not sure the information will help you debug.

insukjoung avatar Jan 12 '22 07:01 insukjoung

Closed by #831

(still working on the 2020 patch)

GiovanniBussi avatar Sep 16 '22 13:09 GiovanniBussi