qmcpack icon indicating copy to clipboard operation
qmcpack copied to clipboard

Population instability in DMC

Open jtkrogel opened this issue 2 years ago • 3 comments

Describe the bug

DMC population eventually diverges for bilayer systems MoS2 and MoSe2. Perhaps coincidental, but divergence only observed at largest separation distances.

Below shows population traces for 1) MoS2 at 6A separation, 2&3) MoSe2 at 8 A separation with two different jastrows:

wpop_explosion

Runs were performed with 16 twists using batched drivers. The population divergence at one twist caused the run to crash for all twists.

Checks are currently being performed with both the batched and legacy drivers.

Expected behavior

DMC is stable

System:

  • system name: polaris

Additional context Variance/energy ratio is good (0.014 Ha) and overall variance is not large (<10 Ha^2).

File location:
/lus/grand/projects/PSFMat_2/shared/2d_database/MoSe2_AAp-4/dmc_MoSe2_AAp_pbe_u_None_2x2x1_4x4x1_8000

jtkrogel avatar Jul 14 '23 14:07 jtkrogel

Ugh. I think there are two issues here: (1) the DMC is unstable for whatever reason (2) a failure in one twist brings down the ensemble. This latter one should be mostly avoidable.

Suggestion: try larger timestep and see if the problem can be triggered sooner. If the hypothesis that a "bad move" is generated is correct, then it would be triggered sooner and the difference between legacy and batched drivers would increase, since the batched does not limit move sizes.

prckent avatar Jul 14 '23 15:07 prckent

Following discussions with Jaron, I attempted to reproduce this over the weekend, using a CPU non-mpi build. Each individual twist in this run could be run on a single node. With similar settings I was able to catch a blip in the population in the negative direction, if not a complete failure. It is barely visible in the energy, and if we weren't doing this investigation I think would normally be considered healthy. This used the batched code, 2016 total walkers, 500 blocks of 10 steps, 0.005 a.u. timestep. Run with 48 threads. A smaller run with 1008 total walkers 250 blocks of 10 steps did not have visible problems; 24 threads were used in this case. I think this adds to the hypothesis that we "simply" have a sporadic Monte Carlo "problem", such as putting an electron somewhere unlikely. A sporadic bug can not be ruled out, but it would have to survive many blocks.

I think it is also interesting that this system takes 500+ steps to equilibrate. Warmup was set to 100. The trial energy (not shown) does track the ensemble average through the run, so it is not a cause for concern.

Edit: This is a rerun of g014 from the original ensemble run, the single twist that failed in that instance.

test2_pop

test2_energy

prckent avatar Jul 17 '23 13:07 prckent

Update: I don't think I have been able to reproduce this problem, including above. Occasional fluctuations down to ~1800 walkers (10% difference from target) but have not seen a divergence. I have rerun multiple times with different seeds. I also tried an OpenMPI build (12 tasks, 4 threads), 2016 total walkers, 10000 total steps.

prckent avatar Jul 24 '23 16:07 prckent