Continuing a DMC run results in `nan` local energy.
Continuing a DMC run results in nan local energy.
Describe the bug I tried to continue a DMC run for this H2O example: https://github.com/QMCPACK/qmcpack/tree/develop/examples/molecules/H2O and ran into the following error:
Fatal Error. Aborting at WalkerControlBase::sortWalkers
The H2O.s003.dmc.dat contains the following.
# Index LocalEnergy Variance Weight NumOfWalkers AvgSentWalkers TrialEnergy DiffEff LivingFraction
0 -1.7305745032e+01 2.7738748341e-01 -5.7724360458e+12 2.6880000000e+03 0.0000000000e+00 -1.7262185048e+01 9.9455568438e-01 0.0000000000e+00
1 nan nan nan 2.6880000000e+03 0.0000000000e+00 nan 9.9507990827e-01 1.0000000000e+00
All the files including inputs and outputs within the directory are uploaded here: https://github.com/krongch2/h2o_qmc/tree/main/archive_cpu/2_dmc
To Reproduce
- First, I edited the
simple-H2O.xmlfile so that the fieldcheckpoint="-1"becomescheckpoint="0"to ensure that theconfig.h5file is generated after the first DMC run. - Then, I ran the input file
simple-H2O.xmlbybsub dmc.bsub.in.
#!/bin/bash
#BSUB -P mat221
#BSUB -J dmc
#BSUB -o dmc.out
#BSUB -e dmc.err
#BSUB -W 00:20
#BSUB -nnodes 1
#BSUB -alloc_flags "smt1"
module load gcc/9.3.0 spectrum-mpi cuda essl netlib-lapack hdf5/1.10.7 fftw; module use /gpfs/alpine/mat151/world-shared/opt/modules; module load llvm/release-15.0.0-cuda11.0
NNODES=$(((LSB_DJOB_NUMPROC-1)/42))
RANKS_PER_NODE=6
RS_PER_NODE=6
exe_path=/gpfs/alpine/mat151/world-shared/opt/qmcpack/release-3.16.0/build_summit_Clang_offload_cuda_real/bin
export OMP_NUM_THREADS=7
jsrun -n $NNODES -a $RANKS_PER_NODE -c $((RANKS_PER_NODE*OMP_NUM_THREADS)) -g 6 -r 1 -d packed -b packed:$OMP_NUM_THREADS --smpiargs="-disable_gpu_hooks" $exe_path/qmcpack --enable-timers=fine simple-H2O.xml
The QMCPACK executable is located in /gpfs/alpine/mat151/world-shared/opt/qmcpack/release-3.16.0/build_summit_Clang_offload_cuda_real/bin/qmcpack on Summit.
This step autogenerates H2O.s002.cont.xml for the subsequent run.
-
I removed the VMC block from the
H2O.s002.cont.xmlfile. -
I ran the input file
H2O.s002.cont.xmlby submitting the following script withbsub dmc.bsub.cont.in
#!/bin/bash
#BSUB -P mat221
#BSUB -J dmc.cont
#BSUB -o dmc.cont.out
#BSUB -e dmc.cont.err
#BSUB -W 00:20
#BSUB -nnodes 1
#BSUB -alloc_flags "smt1"
module load gcc/9.3.0 spectrum-mpi cuda essl netlib-lapack hdf5/1.10.7 fftw; module use /gpfs/alpine/mat151/world-shared/opt/modules; module load llvm/release-15.0.0-cuda11.0
NNODES=$(((LSB_DJOB_NUMPROC-1)/42))
RANKS_PER_NODE=6
RS_PER_NODE=6
exe_path=/gpfs/alpine/mat151/world-shared/opt/qmcpack/release-3.16.0/build_summit_Clang_offload_cuda_real/bin
export OMP_NUM_THREADS=7
jsrun -n $NNODES -a $RANKS_PER_NODE -c $((RANKS_PER_NODE*OMP_NUM_THREADS)) -g 6 -r 1 -d packed -b packed:$OMP_NUM_THREADS --smpiargs="-disable_gpu_hooks" $exe_path/qmcpack --enable-timers=fine H2O.s002.cont.xml
System:
- Summit
- QMCPACK 3.16. Same issue for both
offload_cuda_realandClang_cpu_cplx - module load gcc/9.3.0 spectrum-mpi cuda essl netlib-lapack hdf5/1.10.7 fftw; module use /gpfs/alpine/mat151/world-shared/opt/modules; module load llvm/release-15.0.0-cuda11.0
When directly continued into DMC run
SimpleFixedNodeBranch::checkParameters
Average Energy of a population = nan
Energy Variance = nan
this caused crazy weight/population.
There is no such problem in batched drivers. I don't think it is wise to invest time on legacy drivers.
A quick workaround is setting energyUpdateInterval=0 in the DMC section of the restart run.
But still, it is recommended to move to batched drivers. There are recent fixes #4482 #4484 in batched driver post v3.16 release.
I put up /gpfs/alpine/mat151/world-shared/opt/qmcpack/develop-20230227 for you to get benefit from the recent fixes.
I conclude from this that the current restart test is not good enough. DMC restarts are required for science production. I wonder when this got in or what is different in this case?
I have switched to the batch driver, and restart seems to be working now. Thank you for suggestion.