qmcpack icon indicating copy to clipboard operation
qmcpack copied to clipboard

Energy density t-moves related crash in DMC

Open prckent opened this issue 1 year ago • 6 comments

Describe the bug Attempts to run the example in https://github.com/QMCPACK/qmcpack/pull/5214 are unsuccessful. In a pure CPU build (GCC14, real, MPI), I get a reliable SEGV after a few blocks of DMC when the energy density estimator is enabled. I could not get runs without the energy density estimator to crash, including runs with no estimators. I also tried putting many small VMC sections ahead of the DMC section, but could not get a crash in VMC, only the DMC. Crashes were obtained with 16xMPI 1 thread each, 4xMPI 4 threads each, 1x MPI 16 threads, and 1 x MPI 1 thread.

To Reproduce Modify qmcpack/nexus/examples/qmcpack/rsqmc_misc/estimators/iron_ldaU_dmc.py to run calculations by setting generate only = 0; run. This needs both QE and QMCPACK. Starting from scratch, the first crash takes O(1h). The actual DMC crash can be rigged to occurs within minutes.

I can provide just the generated inputs including jastrow & orbital files if preferred.

Unhelpful error:

nitrogen2:3382958] *** Process received signal ***
[nitrogen2:3382958] Signal: Segmentation fault (11)
[nitrogen2:3382958] Signal code: Address not mapped (1)
[nitrogen2:3382958] Failing at address: (nil)
[nitrogen2:3382958] [ 0] /lib64/libc.so.6(+0x3e730)[0x7f1c1be3e730]
[nitrogen2:3382958] [ 1] qmcpack[0x989b1f]
[nitrogen2:3382958] [ 2] qmcpack[0x87567b]
[nitrogen2:3382958] [ 3] qmcpack[0x802eab]
[nitrogen2:3382958] [ 4] qmcpack[0x6510ff]
[nitrogen2:3382958] [ 5] qmcpack[0x642637]
[nitrogen2:3382958] [ 6] qmcpack[0x63caa8]
[nitrogen2:3382958] [ 7] /home/pk7/apps/spack/opt/spack/linux-rhel9-zen3/gcc-11.5.0/gcc-14.2.0-5c6egxwthhh2tklbcegw5y7yjk2me35s/lib64/libgomp.so.1(GOMP_parallel+0x46)[0x7f1c1ef355e6]
[nitrogen2:3382958] [ 8] qmcpack[0x63e086]
[nitrogen2:3382958] [ 9] qmcpack[0x527cdb]
[nitrogen2:3382958] [10] qmcpack[0x52be66]
[nitrogen2:3382958] [11] qmcpack[0x52f920]
[nitrogen2:3382958] [12] qmcpack[0x4d8a93]
[nitrogen2:3382958] [13] /lib64/libc.so.6(+0x295d0)[0x7f1c1be295d0]
[nitrogen2:3382958] [14] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f1c1be29680]
[nitrogen2:3382958] [15] qmcpack[0x51c625]

Typical output:

 branching cutoff scheme = classic
  branch cutoff, max      = 5.0000e+01 7.5000e+01
  QMC Status (BranchMode) = 0000001101
===================================================================
--- Memory usage report : DMCBatched after initialLogEvaluation ---
===================================================================
Available memory on node 0, free + buffers :   79911 MiB
Memory footprint by rank 0 on node 0       :     627 MiB
===================================================================
Completed block    1 of 5 average 2.453 secs/block after 232.7 secs
Completed block    2 of 5 average 2.428 secs/block after 235.1 secs

Expected behavior No crash

System:

nitrogen2, nightly "gcc new mpi" configuration with GCC 14.2.0, OpenMPI etc.

prckent avatar Nov 27 '24 19:11 prckent

Quick follow-up: Interestingly, switching non-local moves from v3 (used in all the reported crashes) to 'no' made the crash go away. 'v0' restores the crash => there is an issue when t-moves are used with the energy density.

prckent avatar Nov 27 '24 20:11 prckent

If this is not known to work in legacy and/or is not immediately needed, having the energy density work only for locality approximation could be listed as a "known limitation", i.e. the current issue is only a bug in that we claim it is supported when it does not.

@PDoakORNL @jtkrogel What do we know of the status of energy density in legacy with different locality schemes, if anything, and what is needed in the immediate future?

prckent avatar Dec 02 '24 14:12 prckent

Via hand built vmc-dmc input with offload on nvidia I don't have this crash, I'm still looking at it.

PDoakORNL avatar Dec 02 '24 21:12 PDoakORNL

I suggest a regular CPU build.

prckent avatar Dec 03 '24 16:12 prckent

I can reproduce it with CPU build. Looking at it in the debugger now.

PDoakORNL avatar Dec 03 '24 18:12 PDoakORNL

Working on the fix now, updates to many QMCHamiltonian potentials such as LocalECPotential, CoulombPotential will be necessary.

PDoakORNL avatar Dec 04 '24 16:12 PDoakORNL