qmcpack icon indicating copy to clipboard operation
qmcpack copied to clipboard

Segmentation Fault during OPTJ12 (SOC, GPU, Batched) on Perlmutter

Open annette-lopez opened this issue 5 months ago • 3 comments

The Issue In following the standard DMC+SOC workflow (QE SCF > QE NSCF > convertpw4qmc > J2 opt > J3 opt > DMC), the optJ12 step appears to initialize properly, producing the *s000.scalar.dat and showing a standard, nonissue error of QMCPACK ERROR Primitive cell ion 0 vs supercell ion 0 atomic number not matching: 0 vs 24 QMCPACK ERROR Primitive cell ion 1 vs supercell ion 2 atomic number not matching: 0 vs 52 QMCPACK ERROR Primitive cell ion 2 vs supercell ion 3 atomic number not matching: 0 vs 52

However, it terminates early on from a segmentation fault.

optJ12_segfault.zip

To Reproduce Machine: Perlmutter, GPUs QMCPACK v4.1.9 sbatch file attached contains modules and executables

Expected behavior 12 optimization cycles should be completed. It stopped during the first cycle.

System: monolayer CrTe2 supercell with SOC

Additional context Notes: these files are generated with Nexus and submitted by hand, eshdf.h5 was not attached due to size (61 GB), The GPU executable qmcpack_complex for this build was previously tested and ran properly on an Fe atom with Perlmutter GPUs.

annette-lopez avatar Jul 11 '25 15:07 annette-lopez

Hmm. Was your successful Fe atom test prepared the same way?

[The atomic numbers did not make it, but the failing test is meant to be forgiving in this case because convert4pwqmc doesn't have access to them. ( The proper solution is for pw2qmcpack to convert spinors, but obviously it doesn't do that yet.)]

prckent avatar Jul 11 '25 17:07 prckent

The Fe atom test only used qmcpack_complex to run the DMC. I had reused optimizations for J123 from a previous CPU run.

annette-lopez avatar Jul 11 '25 17:07 annette-lopez

dmc_gcta_test1.zip dmc_gcta_test2.zip

I encountered another segmentation fault running DMC+SOC on this system.

annette-lopez avatar Jul 19 '25 12:07 annette-lopez