Sudden jump in VMC and nan in DMC energies using Frontier
Describe the bug VMC energies and the variance suddenly jump for twists number 0 and 1. Although, they seem to recover for the both twists, the twist number 1 later on gets nan energies in the DMC calculation.
To Reproduce Steps to reproduce the behavior: QMCPACK 3.17.9 (Dec 22nd) Frontier Using the Frontier build script All the input and smaller statistical output files are provided in the attachment Wavefunctions are provided in /lustre/orion/mat151/proj-shared/qmcpack_bug_issue_4903
Expected behavior
From Frontier
Local energy
Variance
In the figures, it looks like there is only jump in the VMC energies, but grep nan *scalar.dat shows persistent nan values in the dmc.g001.s002.scalar.dat file upon inspection.
From Cades:
Local energy
Variance
System: Frontier
Additional context input and statistical output files From Frontier dmc_WSe2_AAp_pbe_u_None_4x4x1_2x2x1_2500.tar.gz From Cades dmc_WSe2_AAp_pbe_u_None_4x4x1_2x2x1_2500_cades.tar.gz
Could you rerun with exactly the same condition and see if the issue is reproducible?
I ran it twice and observed the jump in VMC in both times. I didn't check for the nan errors in the first try.
Here are the results from the first run I made (the results reported at the top are from the second run):
qmca -q eV *.scalar.dat LocalEnergy Variance ratio dmc.g000 series 0 1031677669175257708749823602787387928526500921344.000000 +/- 1026345965942322773734163902878667705320520810496.000000 277413416700286380944563850635856933246045636680252969819277578653873886403926720519161636131019161600.000000 +/- 275979746034657924841166118984974651779648503949948056700927631438939421145829807712974513419170349056.000000 268895436034838101943021826030389515654447120450060288.0000 dmc.g001 series 0 -2759.381274 +/- 0.289404 22234.077176 +/- 20883.370062 8.0576 dmc.g002 series 0 -2759.132788 +/- 0.017212 33.140712 +/- 0.145202 0.0120 dmc.g003 series 0 -313831855792393308948292292026331824128.000000 +/- 311787326824377924783786256208643489792.000000 2037757224439653914892500229815477701337931576086583 9814824687690869529790243667968.000000 +/- 20244817917585162855725057061160757600720053693529940553593344515323351058201182208.000000 64931497132262932133405816933814858135109632.0000
Comparing 1st and the 2nd run, different twists were affected except for gamma which seems to be problematic in both cases. Inputs and the statistics outputs of the first run are attached here:
dmc_WSe2_AAp_pbe_u_None_4x4x1_2x2x1_2500_first.tar.gz
The first and the second run only differ in the "walkers_per_rank" parameter.
Could you rerun with export HSA_ENABLE_SDMA=0 in your job script for a known AMD software bug?
With HSA_ENABLE_SDMA=0, it seems to be improved, but not fully resolved. Now, I only see the energy jump in VMC, but no nan values in DMC. Run 1:
qmca -q eV *.scalar.dat -at
LocalEnergy Variance ratio
avg series 0 71988882972599952.000000 +/- 71494108154439304.000000 2033443914202484946718242814936265785344.000000 +/- 2019468189043949633742443321418877763584.000000 28246637956258387197952.0000
avg series 1 -2762.208943 +/- 0.349528 33.730507 +/- 0.127176 0.0122
avg series 2 -2762.457672 +/- 0.063145 33.267233 +/- 0.095556 0.0120
qmca -q eV *.scalar.dat
LocalEnergy Variance ratio dmc.g000 series 0 -2759.200685 +/- 0.015851 33.589793 +/- 0.386683 0.0122 dmc.g000 series 1 -2762.312811 +/- 0.309343 34.011915 +/- 0.422985 0.0123 dmc.g000 series 2 -2762.369345 +/- 0.081484 33.179224 +/- 0.347585 0.0120
dmc.g001 series 0 210207538279997248.000000 +/- 209153859773708928.000000 5934342188785026234929344173559143989248.000000 +/- 5904595925332901835902739230938844102656.000000 28230872390886664175616.0000 dmc.g001 series 1 -2762.129798 +/- 0.420648 34.077298 +/- 0.264550 0.0123 dmc.g001 series 2 -2762.497175 +/- 0.066004 33.121632 +/- 0.199665 0.0120
dmc.g002 series 0 -2759.158370 +/- 0.020109 33.546908 +/- 0.294709 0.0122 dmc.g002 series 1 -2762.184378 +/- 0.285604 33.127810 +/- 0.281526 0.0120 dmc.g002 series 2 -2762.498208 +/- 0.102641 33.159337 +/- 0.234883 0.0120
dmc.g003 series 0 -2759.098096 +/- 0.022484 33.131910 +/- 0.231165 0.0120 dmc.g003 series 1 -2762.208786 +/- 0.392482 33.634674 +/- 0.185123 0.0122 dmc.g003 series 2 -2762.459931 +/- 0.026254 33.521872 +/- 0.473860 0.0121
Run 2:
qmca -q eV *.scalar.dat -at
LocalEnergy Variance ratio avg series 0 -2759.165978 +/- 0.015274 158.794921 +/- 124.896919 0.0576 avg series 1 -2762.284685 +/- 0.352293 33.699636 +/- 0.158610 0.0122 avg series 2 -2762.576489 +/- 0.036107 33.397349 +/- 0.196986 0.0121
qmca -q eV *.scalar.dat
LocalEnergy Variance ratio dmc.g000 series 0 -2759.225573 +/- 0.018721 32.917302 +/- 0.157844 0.0119 dmc.g000 series 1 -2762.497002 +/- 0.338240 33.687777 +/- 0.145556 0.0122 dmc.g000 series 2 -2762.647535 +/- 0.046788 33.400399 +/- 0.364039 0.0121
dmc.g001 series 0 -2759.127363 +/- 0.014157 33.528921 +/- 0.217131 0.0122 dmc.g001 series 1 -2762.123021 +/- 0.334773 33.710596 +/- 0.213986 0.0122 dmc.g001 series 2 -2762.494155 +/- 0.058186 33.305032 +/- 0.499683 0.0121
dmc.g002 series 0 -2759.163331 +/- 0.013737 33.054767 +/- 0.174415 0.0120 dmc.g002 series 1 -2762.142878 +/- 0.373252 33.205315 +/- 0.198137 0.0120 dmc.g002 series 2 -2762.580997 +/- 0.045814 33.453149 +/- 0.511991 0.0121
dmc.g003 series 0 -2759.147645 +/- 0.060131 535.335990 +/- 499.286335 0.1940 dmc.g003 series 1 -2762.375838 +/- 0.365324 34.056943 +/- 0.498526 0.0123 dmc.g003 series 2 -2762.583269 +/- 0.058106 33.385943 +/- 0.289469 0.0121
It seems that you are using hybridrep + GPU, this is still under development. Could you run with gpu=no to sposet_builder line?
@ye-luo Is hybridrep+GPU incomplete or known to be buggy or just not tested enough (etc.)? If it is known to be incomplete then it should be blocked off or have an unmissable warning printed.
@kayahans Have you been able to run this elsewhere (NERSC CPUs?)? It is more important that you can publish the science than spend any time chasing this.
@prckent I ran these calculations in Cades.I have attached the input files I used and the trace data plots in the issue post at the top.
It seems that you are using hybridrep + GPU, this is still under development. Could you run with gpu=no to sposet_builder line?
@ye-luo Should I run this in Frontier again?
@kayahans
- are runs on Cades all good? If not, we probably need to first look into other reason for its failure before touching GPUs.
- Regarding hybrd on GPU, it should technically work, code paths are routed through single walker API and make tests pass but the performance is very pool. So it is not recommended for using on GPU right now. If you have production needs on GPU, it is recommended to just run hybrid SPO on CPU.
- Why the code is behaving strangely on Frontier, it is hard to make a guess. To rule out AMD software issue, I would to have runs on NVIDIA machines first to rule out bad code on our side.
Thanks @ye-luo, yes I had no such issues when running this particular or other bilayered materials at Cades which is a CPU only machine. I think your suggestion is to run the same calculation on Polaris?
Thanks @ye-luo, yes I had no such issues when running this particular or other bilayered materials at Cades which is a CPU only machine. I think your suggestion is to run the same calculation on Polaris?
My suggestions is putting hybridrep on CPU even you are using GPU.
@ye-luo Running with the hybrid rep on CPU seems to solve the problem. I didn't see any spikes in VMC energy with hybrid rep on CPU. Here are the VMC total energies compared with the run in Cades vs Frontier, they are identical: Cades:
LocalEnergy Variance ratio
avg series 0 -2759.145563 +/- 0.006540 33.248169 +/- 0.064987 0.0121
Frontier:
LocalEnergy Variance ratio
avg series 0 -2759.149956 +/- 0.005542 33.390082 +/- 0.090175 0.0121
Frontier VMC trace:
After discussions today I am wondering if this problem has been distinguished from the known and ongoing problems with Frontier that are not specific to QMCPACK or if it could be a problem with the hybrid rep GPU implementation (i.e. our bug)? Are you able to run on Polaris or Perlmutter GPU OK? Did the Frontier run use multiple threads? I would not expect multiple thread runs on Frontier to be reliable but would expect them to be reliable on NVIDIA GPUs. The main thing is to secure a reliable route somewhere to get this research finished and published.
Hi Paul,
Are you able to run on Polaris or Perlmutter GPU OK?
Yes, both computers worked fine for the same run I tested.
Did the Frontier run use multiple threads?
Yes, I set OMP_NUM_THREADS=7
I would not expect multiple thread runs on Frontier to be reliable but would expect them to be reliable on NVIDIA GPUs. am wondering if this problem has been distinguished from the known and ongoing problems with Frontier that are not specific to QMCPACK or if it could be a problem with the hybrid rep GPU implementation (i.e. our bug)?
All the smaller (2x2x1) supercells with different interlayer separations of bilayer MoTe2 worked fine in Frontier, but larger supercells (3x3x1 and 4x4x1) failed. I tried turning the hybrid rep off, but it didn't change the outcome.
Due to race conditions and similar problems, it is not worth investing more human time in OMP_NUM_THREADS>1 runs on Frontier until we have received and tested updated system software. We'll make an announcement. The problems can seemingly occur at any moment and for any system size -- or remain hidden. OMP_NUM_THREADS=1 runs are believed safe and reliable. => Only use OMP_NUM_THREADS=1 for now.