Segmentation Fault during OPTJ12 (SOC, GPU, Batched) on Perlmutter
The Issue
In following the standard DMC+SOC workflow (QE SCF > QE NSCF > convertpw4qmc > J2 opt > J3 opt > DMC), the optJ12 step appears to initialize properly, producing the *s000.scalar.dat and showing a standard, nonissue error of
QMCPACK ERROR Primitive cell ion 0 vs supercell ion 0 atomic number not matching: 0 vs 24 QMCPACK ERROR Primitive cell ion 1 vs supercell ion 2 atomic number not matching: 0 vs 52 QMCPACK ERROR Primitive cell ion 2 vs supercell ion 3 atomic number not matching: 0 vs 52
However, it terminates early on from a segmentation fault.
To Reproduce Machine: Perlmutter, GPUs QMCPACK v4.1.9 sbatch file attached contains modules and executables
Expected behavior 12 optimization cycles should be completed. It stopped during the first cycle.
System: monolayer CrTe2 supercell with SOC
Additional context Notes: these files are generated with Nexus and submitted by hand, eshdf.h5 was not attached due to size (61 GB), The GPU executable qmcpack_complex for this build was previously tested and ran properly on an Fe atom with Perlmutter GPUs.
Hmm. Was your successful Fe atom test prepared the same way?
[The atomic numbers did not make it, but the failing test is meant to be forgiving in this case because convert4pwqmc doesn't have access to them. ( The proper solution is for pw2qmcpack to convert spinors, but obviously it doesn't do that yet.)]
The Fe atom test only used qmcpack_complex to run the DMC. I had reused optimizations for J123 from a previous CPU run.
dmc_gcta_test1.zip dmc_gcta_test2.zip
I encountered another segmentation fault running DMC+SOC on this system.