MPC causes segfault on Frontier
Describe the bug
Use of MPC is unstable on Frontier (CPU code). A handful of FeCl2 runs have segfaulted, one run has produced NaN.
To Reproduce
Build details:
Git branch: develop
Last git commit: 283f2438770bdfb592d161d287771764cbf6f96c
Last git commit date: Sat Aug 26 09:36:21 2023 -0500
Last git commit subject: Merge pull request #4715 from QMCPACK/prckent-patch-1
Currently Loaded Modules:
1) craype-x86-trento 13) darshan-runtime/3.4.0
2) libfabric/1.15.2.0 14) hsi/default
3) craype-network-ofi 15) DefApps/default
4) perftools-base/22.12.0 16) emacs/28.1
5) xpmem/2.6.2-2.5_2.22__gd067c3f.shasta 17) cmake/3.23.2
6) cray-pmi/6.1.8 18) openblas/0.3.17
7) cce/15.0.0 19) cray-fftw/3.3.10.3
8) craype/2.7.19 20) hdf5/1.14.0
9) cray-dsmml/0.2.2 21) boost/1.79.0
10) cray-mpich/8.1.23 22) rocm/5.5.1
11) cray-libsci/22.12.1.1 23) ninja/1.10.2
12) PrgEnv-cray/8.3.3
Executable:
/lustre/orion/world-shared/mat151/pk7/try_frontier/build_frontier_cpu_real_MP/bin/qmcpack
Problem cases (segfault):
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180/qmc.out:srun: error: frontier04992: task 7: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024/qmc.out:srun: error: frontier08960: task 4: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680/qmc.out:srun: error: frontier10366: task 6: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2400/qmc.out:srun: error: frontier00384: task 5: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360/qmc.out:srun: error: frontier08319: task 3: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840/qmc.out:srun: error: frontier00208: task 6: Segmentation fault (core dumped)
FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720/qmc.out:srun: error: frontier00201: task 0: Segmentation fault (core dumped)
Problem case (NaN in scalar.dat):
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680
Location on Frontier:
/lustre/orion/mat151/proj-shared/ecp_vdw_test_runs/frontier_files/test_runs_jk_cpu/runs_2023-09-11-09-15-23
To reproduce, copy the relevant files in a new directory and resubmit (sbatch qmc.sbatch.in).
Expected behavior No segfaults or NaN's
The NaN is in the scalar.data but the NaN detector in the wavefunction components was not tripped. => There is most likely a problem with just the MPC computation.
runs_2023-09-11-09-15-23]$ grep -n -i NaN */*.scalar.dat
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680/vmc.s000.scalar.dat:3: 1 -1.2273884599e+03 1.5065084464e+06 -1.8482543823e+03 6.2086592258e+02 -1.2991870771e+04 2.0324763583e+02 5.9220107535e+03 5.0183579990e+03 -nan 4.0320000000e+04 8.9543626972e+01 6.6160342262e-01
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680/vmc.s000.scalar.dat:4: 2 -1.2275045133e+03 1.5067927820e+06 -1.8592057578e+03 6.3170124457e+02 -1.2992550874e+04 2.0367290127e+02 5.9113142159e+03 5.0183579990e+03 -nan 4.0320000000e+04 8.9694526033e+01 6.5901697875e-01
The segfaults are quasi-reproducible when run with the same seed (single node runs in all cases). The reproduction rate is better than 50%.
Below, * indicates segfaults that appear uniquely in a set of runs. All others reproduce. The behavior is likely non-deterministic and any ported fix should rerun a few times for verification.
Original set:
runs_2023-09-11-09-15-23
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680
*FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2400
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840
FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720
Reruns:
runs_2023-09-11-12-31-45
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680
*FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2880
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840
*FeCl2-tile-4-hyb-0-spo-0-est-0-walk-300
*FeCl2-tile-4-hyb-0-spo-0-est-0-walk-512
FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720
Also, I observed no NaN's in scalar.dat for the reruns.