Energy density estimator tests near 100% fail rate
Describe the bug
Looking at the month long statistics for tests named energydensity, the fail rate is near 100% https://cdash.qmcpack.org/testOverview.php?project=QMCPACK&begin=2025-04-01&end=2025-04-27&filtercount=1&showfilters=1&field1=testname&compare1=63&value1=energydensity
These tests are mostly for legacy where the code has not been intentionally changed in months. Possible explanations include unintentional breakage of the tests, of the underlying code, or perhaps the tests &/or code were simply not robust after the last time they were "fixed". We could also have h5py issues similar to https://github.com/QMCPACK/qmcpack/pull/3367 Most of the reported numerics are close to passing, so larger error bars could also be part of the solution.
To Reproduce
ctest -R energydensity , any recent version
Expected behavior
Test should pass; we should have confidence in energy density.
System:
Any
Additional context
Did not yet try to bisect failure or check historical PR discussions.
Note that sometimes the tests do pass e.g. https://cdash.qmcpack.org/testSummary.php?project=1&name=short-diamondC_1x1x1_pp-vmc-estimator-energydensity-cell-4-4-check&date=2025-04-25 has 3 of 18 runs passing in VMC.
The earliest cdash results we have are in the last week of February and the tests were failing then https://cdash.qmcpack.org/testOverview.php?project=QMCPACK&begin=2025-02-23&end=2025-02-28&filtercount=1&showfilters=1&field1=testname&compare1=63&value1=energydensity
Possibly we didn't follow-up and fix the underlying problems after fixing check_stats.py (bisecting from v3.17.1 to today):
The following tests FAILED:
431 - short-LiH_ae-vmc_hf_noj_estimator_energydensity_voronoi-4-4-check (Failed)
547 - short-LiH_pp-vmc_hf_sdj_estimator_energydensity_voronoi-4-4-check (Failed)
1240 - short-diamondC_1x1x1_pp-vmc-estimator-energydensity-cell-4-4-check (Failed)
1242 - short-diamondC_1x1x1_pp-vmc-estimator-energydensity-voronoi-4-4-check (Failed)
1244 - short-diamondC_1x1x1_pp-dmc-estimator-energydensity-cell-4-4-check (Failed)
Errors while running CTest
313cf7ece465007b72d42164a7f3c7ad7e10e767 is the first bad commit
commit 313cf7ece465007b72d42164a7f3c7ad7e10e767
Author: Ye Luo <[email protected]>
Date: Wed May 15 13:17:27 2024 -0500
check_stats.py properly exits when h5py read fails.
tests/scripts/check_stats.py | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
bisect found first bad commit
This estimator definitely worked at the end of 2022. I started looking into what was going on once Gani flagged the issue (#5224), but didn't dig very far. The tests have a deterministic component that should always work (checksums vs the total energy).
What I remember is that zeros were being written into the stat.h5 file. This might suggest zeros are being passed through TraceManager.
A good first check would be to enable the in-situ energy sum checks for the energy density (see TRACE_CHECK environment variable). If zero values are being passed through TraceManager, then this will catch it. If this check fails, then the per-particle quantities are not being produced by the individual energy components (coulomb potential, etc), indicating the request from the user via input is not being propagated down through TraceManager to the potentials, etc.
If the above check passes, then a further in-situ test can be enabled in the energy density estimator itself (see environment variable ENERGYDENSITY_CHECK). If this check fails, the problem resides in the estimator itself. If it passes, then the issue resides between the estimator and the stat.h5 output.
I will note a number of changes to how HDF was handled in the code were made during this time period (see https://github.com/QMCPACK/qmcpack/commits/develop/src/QMCHamiltonians/EnergyDensityEstimator.cpp), including some that affected TraceManager (see https://github.com/QMCPACK/qmcpack/pull/4995).
Very few changes were made to TraceManager itself (see https://github.com/QMCPACK/qmcpack/commits/develop/src/Estimators/TraceManager.h and https://github.com/QMCPACK/qmcpack/commits/develop/src/Estimators/TraceManager.cpp )
I believe this is the last confirmation noted that the energy density estimator was working on CDash: https://github.com/QMCPACK/qmcpack/pull/3294
It surely worked long after this, but passing tests were observed at that time. For some reason, the cdash links no longer work: https://cdash.qmcpack.org/CDash/testDetails.php?test=14660932&build=229279 https://cdash.qmcpack.org/CDash/testDetails.php?test=14379785&build=229242
The log output of the python test script would confirm this. Perhaps we could find the log output from July 2021, and then bisect the cdash log output (visually) from there?
At least one of the energy density tests (diamond cell grid for DMC) appears to be very close to passing currently:
https://cdash.qmcpack.org/tests/22015901?graph=status
All the statistical tests pass, as well as a deterministic block sum check. Only the deterministic comparison with the scalar.dat file fails:
Energy density sums vs. scalar file per block tests:
E 0 -10.533540356798579!=-10.53344467223
some per block sums do not match the scalar file
status of this test: fail
This disagreement could result from a difference in how weights are handled between scalar.dat and stat.h5, if it exists. I would in general expect quantities from scalar.dat and stat.h5 to deterministically agree with each other to high precision.
The voronoi tests on LiH (ae and pp) have most tests passing also, but show a different pattern. All the deterministic checks pass, as well as the statistical tests on the kinetic energy and weight. The statistical test on the total potential energy passes, but the partial sum checks (the potential energy on each atom) fail with a statistical significance between 8 and 199 sigma:
https://cdash.qmcpack.org/tests/22015433 https://cdash.qmcpack.org/tests/22015552
E.g.:
Testing quantity: V full sum
reference mean value : -1.45471078
reference error bar : 0.00224560
computed mean value : -1.45494147
computed error bar : 0.00255507
pass tolerance : 0.00673680 ( 3.00000000 sigma)
deviation from reference : -0.00023069 ( -0.10273063 sigma)
error bar of deviation : 0.00340163
significance probability : 0.08182334 (gaussian statistics)
status of this test : pass
Testing quantity: V partial sum 0
reference mean value : -0.06180385
reference error bar : 0.00082984
computed mean value : -0.22734097
computed error bar : 0.00070774
pass tolerance : 0.00265958 ( 3.20494568 sigma)
deviation from reference : -0.16553712 (-199.48169192 sigma)
error bar of deviation : 0.00109065
significance probability : 1.00000000 (gaussian statistics)
status of this test : fail
Testing quantity: V partial sum 1
reference mean value : -1.39290693
reference error bar : 0.00222893
computed mean value : -1.22760050
computed error bar : 0.00252165
pass tolerance : 0.00714360 ( 3.20494568 sigma)
deviation from reference : 0.16530643 ( 74.16407161 sigma)
error bar of deviation : 0.00336554
significance probability : 1.00000000 (gaussian statistics)
status of this test : fail
The VMC diamond cell grid test shows intermittent passing, unlike the others:
https://cdash.qmcpack.org/tests/22015955?graph=status
One reason this statistical test is less stable than others is that it involves 24 statistical checks instead of one. This makes it much more likely to fail statistically. A larger nsigma criterion would be an appropriate fix for this one (and should be applied to the others above once a fix is identified).
The above tests are for GCC14 on nitrogen. Other environments also given intermittent passing results for the diamond VMC case:
sulfur GCC14 and Clang20 nitrogen2 GCC14 and Clang20 bora Intel21
In short, the following issues need investigating (in probable priority order):
- [ ] Why do scalar.dat and stat.h5 give deterministically different quantities for DMC? (likely an issue broader than energy density)
- [ ] Why does voronoi energy density give incorrect values for the potential energy partitioning among atoms? (suspicious since energy density code has not changed for a long time, indicating code changes elsewhere are responsible, and hence possibly relating to a broader issue)
- [ ] Does the energy density function in broader/realistic cases? I.e. why did Gani see zeros, and is this behavior still happening?
OK, lets discuss when @PDoakORNL is back since he may have seen some additional problems while working on the batched version.
Note that unfortunately we don't have cdash data older than ~3 months. The database was getting too large.
I'm back now and will look into this again. I had fixed these tests in legacy but I'm not sure if it was PR'd in incompletely or if this is a regression. Hopefully less burnt out now post vacation I will be able to get these straightened out. I can certainly attest that in the last couple months I have had the legacy energy density estimator passing its tests #5258 was my attempt to get this fix into develop. My batched code in merged into several versions of develop was passing those same tests. However pre #5258 legacy had been broken simply because the hdf5 file organization had been inadvertently changed for a long time.
The current develop batched EDE is not "deterministically" correct at all in develop I don't know how many sub PR's that's going to take since I'm trying to get batched structure factor entirely in first. So for now I would entirely ignore what's going on with batched.
Thanks Peter. I suggest that after structure factor we make a list of problems with energy density and its tests, what we know about when they were last working etc. and then meet to work out how to go about fixing them. e.g. Can we setup some extra deterministic integration or (better) unit tests that pass in a known good version of legacy?
It is worth adding a deterministic test that should pass to high precision (regardless of run length): in post-processing, sum the energy density components and verify that they match the total energy reported in the dat files.
The current integration tests do this, but they also bundle in statistical tests which obscures the cause when just looking at the red/green indicator level.