qmcpack icon indicating copy to clipboard operation
qmcpack copied to clipboard

Sudden jump in VMC and nan in DMC energies using Frontier

Open kayahans opened this issue 1 year ago • 16 comments

Describe the bug VMC energies and the variance suddenly jump for twists number 0 and 1. Although, they seem to recover for the both twists, the twist number 1 later on gets nan energies in the DMC calculation.

To Reproduce Steps to reproduce the behavior: QMCPACK 3.17.9 (Dec 22nd) Frontier Using the Frontier build script All the input and smaller statistical output files are provided in the attachment Wavefunctions are provided in /lustre/orion/mat151/proj-shared/qmcpack_bug_issue_4903

Expected behavior From Frontier Local energy Screen Shot 2024-01-18 at 3 14 31 PM Variance Screen Shot 2024-01-18 at 3 14 38 PM

In the figures, it looks like there is only jump in the VMC energies, but grep nan *scalar.dat shows persistent nan values in the dmc.g001.s002.scalar.dat file upon inspection.

From Cades: Local energy Screen Shot 2024-01-18 at 3 39 13 PM

Variance Screen Shot 2024-01-18 at 3 39 18 PM

System: Frontier

Additional context input and statistical output files From Frontier dmc_WSe2_AAp_pbe_u_None_4x4x1_2x2x1_2500.tar.gz From Cades dmc_WSe2_AAp_pbe_u_None_4x4x1_2x2x1_2500_cades.tar.gz

kayahans avatar Jan 18 '24 20:01 kayahans

Could you rerun with exactly the same condition and see if the issue is reproducible?

ye-luo avatar Jan 18 '24 20:01 ye-luo

I ran it twice and observed the jump in VMC in both times. I didn't check for the nan errors in the first try.

kayahans avatar Jan 18 '24 21:01 kayahans

Here are the results from the first run I made (the results reported at the top are from the second run):

qmca -q eV *.scalar.dat LocalEnergy Variance ratio dmc.g000 series 0 1031677669175257708749823602787387928526500921344.000000 +/- 1026345965942322773734163902878667705320520810496.000000 277413416700286380944563850635856933246045636680252969819277578653873886403926720519161636131019161600.000000 +/- 275979746034657924841166118984974651779648503949948056700927631438939421145829807712974513419170349056.000000 268895436034838101943021826030389515654447120450060288.0000 dmc.g001 series 0 -2759.381274 +/- 0.289404 22234.077176 +/- 20883.370062 8.0576 dmc.g002 series 0 -2759.132788 +/- 0.017212 33.140712 +/- 0.145202 0.0120 dmc.g003 series 0 -313831855792393308948292292026331824128.000000 +/- 311787326824377924783786256208643489792.000000 2037757224439653914892500229815477701337931576086583 9814824687690869529790243667968.000000 +/- 20244817917585162855725057061160757600720053693529940553593344515323351058201182208.000000 64931497132262932133405816933814858135109632.0000

Screen Shot 2024-01-18 at 4 42 08 PM

Comparing 1st and the 2nd run, different twists were affected except for gamma which seems to be problematic in both cases. Inputs and the statistics outputs of the first run are attached here:

dmc_WSe2_AAp_pbe_u_None_4x4x1_2x2x1_2500_first.tar.gz

The first and the second run only differ in the "walkers_per_rank" parameter.

kayahans avatar Jan 18 '24 21:01 kayahans

Could you rerun with export HSA_ENABLE_SDMA=0 in your job script for a known AMD software bug?

ye-luo avatar Jan 22 '24 16:01 ye-luo

With HSA_ENABLE_SDMA=0, it seems to be improved, but not fully resolved. Now, I only see the energy jump in VMC, but no nan values in DMC. Run 1:

qmca -q eV *.scalar.dat -at

LocalEnergy Variance ratio avg series 0 71988882972599952.000000 +/- 71494108154439304.000000 2033443914202484946718242814936265785344.000000 +/- 2019468189043949633742443321418877763584.000000 28246637956258387197952.0000 avg series 1 -2762.208943 +/- 0.349528 33.730507 +/- 0.127176 0.0122 avg series 2 -2762.457672 +/- 0.063145 33.267233 +/- 0.095556 0.0120 Screenshot 2024-01-24 at 12 53 06 PM

Screenshot 2024-01-24 at 12 54 16 PM

qmca -q eV *.scalar.dat

LocalEnergy Variance ratio dmc.g000 series 0 -2759.200685 +/- 0.015851 33.589793 +/- 0.386683 0.0122 dmc.g000 series 1 -2762.312811 +/- 0.309343 34.011915 +/- 0.422985 0.0123 dmc.g000 series 2 -2762.369345 +/- 0.081484 33.179224 +/- 0.347585 0.0120

dmc.g001 series 0 210207538279997248.000000 +/- 209153859773708928.000000 5934342188785026234929344173559143989248.000000 +/- 5904595925332901835902739230938844102656.000000 28230872390886664175616.0000 dmc.g001 series 1 -2762.129798 +/- 0.420648 34.077298 +/- 0.264550 0.0123 dmc.g001 series 2 -2762.497175 +/- 0.066004 33.121632 +/- 0.199665 0.0120

dmc.g002 series 0 -2759.158370 +/- 0.020109 33.546908 +/- 0.294709 0.0122 dmc.g002 series 1 -2762.184378 +/- 0.285604 33.127810 +/- 0.281526 0.0120 dmc.g002 series 2 -2762.498208 +/- 0.102641 33.159337 +/- 0.234883 0.0120

dmc.g003 series 0 -2759.098096 +/- 0.022484 33.131910 +/- 0.231165 0.0120 dmc.g003 series 1 -2762.208786 +/- 0.392482 33.634674 +/- 0.185123 0.0122 dmc.g003 series 2 -2762.459931 +/- 0.026254 33.521872 +/- 0.473860 0.0121

Run 2:

qmca -q eV *.scalar.dat -at

LocalEnergy Variance ratio avg series 0 -2759.165978 +/- 0.015274 158.794921 +/- 124.896919 0.0576 avg series 1 -2762.284685 +/- 0.352293 33.699636 +/- 0.158610 0.0122 avg series 2 -2762.576489 +/- 0.036107 33.397349 +/- 0.196986 0.0121

Screenshot 2024-01-24 at 4 46 54 PM Screenshot 2024-01-24 at 4 47 01 PM

qmca -q eV *.scalar.dat

LocalEnergy Variance ratio dmc.g000 series 0 -2759.225573 +/- 0.018721 32.917302 +/- 0.157844 0.0119 dmc.g000 series 1 -2762.497002 +/- 0.338240 33.687777 +/- 0.145556 0.0122 dmc.g000 series 2 -2762.647535 +/- 0.046788 33.400399 +/- 0.364039 0.0121

dmc.g001 series 0 -2759.127363 +/- 0.014157 33.528921 +/- 0.217131 0.0122 dmc.g001 series 1 -2762.123021 +/- 0.334773 33.710596 +/- 0.213986 0.0122 dmc.g001 series 2 -2762.494155 +/- 0.058186 33.305032 +/- 0.499683 0.0121

dmc.g002 series 0 -2759.163331 +/- 0.013737 33.054767 +/- 0.174415 0.0120 dmc.g002 series 1 -2762.142878 +/- 0.373252 33.205315 +/- 0.198137 0.0120 dmc.g002 series 2 -2762.580997 +/- 0.045814 33.453149 +/- 0.511991 0.0121

dmc.g003 series 0 -2759.147645 +/- 0.060131 535.335990 +/- 499.286335 0.1940 dmc.g003 series 1 -2762.375838 +/- 0.365324 34.056943 +/- 0.498526 0.0123 dmc.g003 series 2 -2762.583269 +/- 0.058106 33.385943 +/- 0.289469 0.0121

kayahans avatar Jan 24 '24 21:01 kayahans

It seems that you are using hybridrep + GPU, this is still under development. Could you run with gpu=no to sposet_builder line?

ye-luo avatar Feb 09 '24 20:02 ye-luo

@ye-luo Is hybridrep+GPU incomplete or known to be buggy or just not tested enough (etc.)? If it is known to be incomplete then it should be blocked off or have an unmissable warning printed.

@kayahans Have you been able to run this elsewhere (NERSC CPUs?)? It is more important that you can publish the science than spend any time chasing this.

prckent avatar Feb 12 '24 13:02 prckent

@prckent I ran these calculations in Cades.I have attached the input files I used and the trace data plots in the issue post at the top.

kayahans avatar Feb 12 '24 15:02 kayahans

It seems that you are using hybridrep + GPU, this is still under development. Could you run with gpu=no to sposet_builder line?

@ye-luo Should I run this in Frontier again?

kayahans avatar Feb 12 '24 18:02 kayahans

@kayahans

  1. are runs on Cades all good? If not, we probably need to first look into other reason for its failure before touching GPUs.
  2. Regarding hybrd on GPU, it should technically work, code paths are routed through single walker API and make tests pass but the performance is very pool. So it is not recommended for using on GPU right now. If you have production needs on GPU, it is recommended to just run hybrid SPO on CPU.
  3. Why the code is behaving strangely on Frontier, it is hard to make a guess. To rule out AMD software issue, I would to have runs on NVIDIA machines first to rule out bad code on our side.

ye-luo avatar Feb 13 '24 22:02 ye-luo

Thanks @ye-luo, yes I had no such issues when running this particular or other bilayered materials at Cades which is a CPU only machine. I think your suggestion is to run the same calculation on Polaris?

kayahans avatar Feb 15 '24 21:02 kayahans

Thanks @ye-luo, yes I had no such issues when running this particular or other bilayered materials at Cades which is a CPU only machine. I think your suggestion is to run the same calculation on Polaris?

My suggestions is putting hybridrep on CPU even you are using GPU.

ye-luo avatar Feb 15 '24 21:02 ye-luo

@ye-luo Running with the hybrid rep on CPU seems to solve the problem. I didn't see any spikes in VMC energy with hybrid rep on CPU. Here are the VMC total energies compared with the run in Cades vs Frontier, they are identical: Cades:

                            LocalEnergy               Variance           ratio
avg  series 0  -2759.145563 +/- 0.006540   33.248169 +/- 0.064987   0.0121

Frontier:

                            LocalEnergy               Variance           ratio
avg  series 0  -2759.149956 +/- 0.005542   33.390082 +/- 0.090175   0.0121

Frontier VMC trace:

Screenshot 2024-02-20 at 2 56 47 PM Screenshot 2024-02-20 at 2 55 23 PM

kayahans avatar Feb 20 '24 19:02 kayahans

After discussions today I am wondering if this problem has been distinguished from the known and ongoing problems with Frontier that are not specific to QMCPACK or if it could be a problem with the hybrid rep GPU implementation (i.e. our bug)? Are you able to run on Polaris or Perlmutter GPU OK? Did the Frontier run use multiple threads? I would not expect multiple thread runs on Frontier to be reliable but would expect them to be reliable on NVIDIA GPUs. The main thing is to secure a reliable route somewhere to get this research finished and published.

prckent avatar Sep 18 '24 17:09 prckent

Hi Paul,

Are you able to run on Polaris or Perlmutter GPU OK?

Yes, both computers worked fine for the same run I tested.

Did the Frontier run use multiple threads?

Yes, I set OMP_NUM_THREADS=7

I would not expect multiple thread runs on Frontier to be reliable but would expect them to be reliable on NVIDIA GPUs. am wondering if this problem has been distinguished from the known and ongoing problems with Frontier that are not specific to QMCPACK or if it could be a problem with the hybrid rep GPU implementation (i.e. our bug)?

All the smaller (2x2x1) supercells with different interlayer separations of bilayer MoTe2 worked fine in Frontier, but larger supercells (3x3x1 and 4x4x1) failed. I tried turning the hybrid rep off, but it didn't change the outcome.

kayahans avatar Sep 18 '24 18:09 kayahans

Due to race conditions and similar problems, it is not worth investing more human time in OMP_NUM_THREADS>1 runs on Frontier until we have received and tested updated system software. We'll make an announcement. The problems can seemingly occur at any moment and for any system size -- or remain hidden. OMP_NUM_THREADS=1 runs are believed safe and reliable. => Only use OMP_NUM_THREADS=1 for now.

prckent avatar Sep 18 '24 18:09 prckent