qmcpack icon indicating copy to clipboard operation
qmcpack copied to clipboard

Continuing a DMC run results in `nan` local energy.

Open krongch2 opened this issue 2 years ago • 4 comments

Continuing a DMC run results in nan local energy.

Describe the bug I tried to continue a DMC run for this H2O example: https://github.com/QMCPACK/qmcpack/tree/develop/examples/molecules/H2O and ran into the following error:

Fatal Error. Aborting at WalkerControlBase::sortWalkers

The H2O.s003.dmc.dat contains the following.

# Index          LocalEnergy            Variance              Weight        NumOfWalkers      AvgSentWalkers         TrialEnergy             DiffEff      LivingFraction
         0   -1.7305745032e+01    2.7738748341e-01   -5.7724360458e+12    2.6880000000e+03    0.0000000000e+00   -1.7262185048e+01    9.9455568438e-01    0.0000000000e+00
         1                 nan                 nan                 nan    2.6880000000e+03    0.0000000000e+00                 nan    9.9507990827e-01    1.0000000000e+00

All the files including inputs and outputs within the directory are uploaded here: https://github.com/krongch2/h2o_qmc/tree/main/archive_cpu/2_dmc

To Reproduce

  1. First, I edited the simple-H2O.xml file so that the field checkpoint="-1" becomes checkpoint="0" to ensure that the config.h5 file is generated after the first DMC run.
  2. Then, I ran the input file simple-H2O.xml by bsub dmc.bsub.in.
#!/bin/bash
#BSUB -P mat221
#BSUB -J dmc
#BSUB -o dmc.out
#BSUB -e dmc.err
#BSUB -W 00:20
#BSUB -nnodes 1
#BSUB -alloc_flags "smt1"

module load gcc/9.3.0 spectrum-mpi cuda essl netlib-lapack hdf5/1.10.7 fftw; module use /gpfs/alpine/mat151/world-shared/opt/modules; module load llvm/release-15.0.0-cuda11.0

NNODES=$(((LSB_DJOB_NUMPROC-1)/42))
RANKS_PER_NODE=6
RS_PER_NODE=6
exe_path=/gpfs/alpine/mat151/world-shared/opt/qmcpack/release-3.16.0/build_summit_Clang_offload_cuda_real/bin

export OMP_NUM_THREADS=7
jsrun -n $NNODES -a $RANKS_PER_NODE -c $((RANKS_PER_NODE*OMP_NUM_THREADS)) -g 6 -r 1 -d packed -b packed:$OMP_NUM_THREADS --smpiargs="-disable_gpu_hooks" $exe_path/qmcpack --enable-timers=fine simple-H2O.xml 

The QMCPACK executable is located in /gpfs/alpine/mat151/world-shared/opt/qmcpack/release-3.16.0/build_summit_Clang_offload_cuda_real/bin/qmcpack on Summit. This step autogenerates H2O.s002.cont.xml for the subsequent run.

  1. I removed the VMC block from the H2O.s002.cont.xml file.

  2. I ran the input file H2O.s002.cont.xml by submitting the following script with bsub dmc.bsub.cont.in

#!/bin/bash
#BSUB -P mat221
#BSUB -J dmc.cont
#BSUB -o dmc.cont.out
#BSUB -e dmc.cont.err
#BSUB -W 00:20
#BSUB -nnodes 1
#BSUB -alloc_flags "smt1"

module load gcc/9.3.0 spectrum-mpi cuda essl netlib-lapack hdf5/1.10.7 fftw; module use /gpfs/alpine/mat151/world-shared/opt/modules; module load llvm/release-15.0.0-cuda11.0

NNODES=$(((LSB_DJOB_NUMPROC-1)/42))
RANKS_PER_NODE=6
RS_PER_NODE=6
exe_path=/gpfs/alpine/mat151/world-shared/opt/qmcpack/release-3.16.0/build_summit_Clang_offload_cuda_real/bin

export OMP_NUM_THREADS=7
jsrun -n $NNODES -a $RANKS_PER_NODE -c $((RANKS_PER_NODE*OMP_NUM_THREADS)) -g 6 -r 1 -d packed -b packed:$OMP_NUM_THREADS --smpiargs="-disable_gpu_hooks" $exe_path/qmcpack --enable-timers=fine H2O.s002.cont.xml

System:

  • Summit
  • QMCPACK 3.16. Same issue for both offload_cuda_real and Clang_cpu_cplx
  • module load gcc/9.3.0 spectrum-mpi cuda essl netlib-lapack hdf5/1.10.7 fftw; module use /gpfs/alpine/mat151/world-shared/opt/modules; module load llvm/release-15.0.0-cuda11.0

krongch2 avatar Feb 28 '23 19:02 krongch2

When directly continued into DMC run

SimpleFixedNodeBranch::checkParameters 
  Average Energy of a population  = nan
  Energy Variance = nan

this caused crazy weight/population.

There is no such problem in batched drivers. I don't think it is wise to invest time on legacy drivers.

ye-luo avatar Feb 28 '23 20:02 ye-luo

A quick workaround is setting energyUpdateInterval=0 in the DMC section of the restart run. But still, it is recommended to move to batched drivers. There are recent fixes #4482 #4484 in batched driver post v3.16 release. I put up /gpfs/alpine/mat151/world-shared/opt/qmcpack/develop-20230227 for you to get benefit from the recent fixes.

ye-luo avatar Feb 28 '23 22:02 ye-luo

I conclude from this that the current restart test is not good enough. DMC restarts are required for science production. I wonder when this got in or what is different in this case?

prckent avatar Mar 01 '23 15:03 prckent

I have switched to the batch driver, and restart seems to be working now. Thank you for suggestion.

krongch2 avatar Mar 02 '23 17:03 krongch2