NaN Values in Gradients Cause Calculation Abortion Using Mixed Precision with GPU Offload
Describe the bug When running a periodic calculation with 9 twists, 724 electrons, and 44 atoms using the mixed precision version of QMCPACK with GPU offload, the calculation was aborted with the following error:
NaNguard::checkOneParticleGradients error message: TWF::calcRatioGrad at particle 687
grads[0] = (-nan,0.0418255)
grads[1] = (-nan,-0.0806002)
grads[2] = (-nan,0.0412396)
Unexpected exception thrown in threaded section
Fatal Error. Aborting at Unhandled Exception
This issue appears to be related to NaN values in the gradients of the wave function for a specific particle.
The same calculation with full precision ran smoothly without any problems.
To Reproduce Input and output files below: dmc_2x2_single_prec-test.zip
Expected behavior The calculation should complete successfully without encountering NaN values in the wave function gradients, resulting in accurate and stable output data.
System: System name: Perlmutter Modules loaded: module use /global/common/software/nersc/n9/llvm/modules module load craype cray-mpich module load llvm/17.0.6-gpu Other systems where this is reproducible: Not tested on other systems.
Additional context The calculation was performed using the complex version of QMCPACK with NVIDIA GPU and OpenMP offload. No other context or error messages where in the output files.
Thanks for the report Roman. ~Is this the first run you have tried or are other runs either working or failing for you?~ Any issues with other runs? I see the full precision run of this system was fine.
I tried it first for the larger system and ended up with the same error as for this smaller system. I did not investigated any further. For full precision, I did not run into any issues as you wrote.