qmcpack icon indicating copy to clipboard operation
qmcpack copied to clipboard

[WIP] User settable period for estimator measurements

Open PDoakORNL opened this issue 6 months ago • 4 comments

Proposed changes

Sometimes its desirable to accumulate measurements for an estimator on some sort of periodic schedule. This PR adds an implementation of this feature as well as an initial crude integration test

What type(s) of changes does this code introduce?

Delete the items that do not apply

  • New feature
  • Testing changes (e.g. new unit/integration/performance tests)
  • Other (please describe):

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

sdgx2

Checklist

Update the following with an [x] where the items apply. If you're unsure about any of them, don't hesitate to ask. This is simply a reminder of what we are going to look for before merging your code.

    • [x] I have read the pull request guidance and develop docs
    • [x] This PR is up to date with the current state of 'develop'
    • [x] Code added or changed in the PR has been clang-formatted
    • [x] This PR adds tests to cover any new code, or to catch a bug that is being fixed
    • [x] Documentation has been added (if appropriate)

PDoakORNL avatar Jun 27 '25 21:06 PDoakORNL

Quick comment: what is the motivation for including the random.h5 files? These are quite large and I would prefer we not have them in the repo. Also, I think a 3 rank test should be sufficient, which would reduce the file count and total size. There is not more coverage at 16 ranks.

prckent avatar Jun 27 '25 22:06 prckent

I am ok with "estimator_measurement_period" as the input tag.

prckent avatar Jun 27 '25 22:06 prckent

I am browsing on the web and might have missed it, but we should enforce (abort) if the estimator_measurement_period is not a factor of the step count.

prckent avatar Jun 27 '25 22:06 prckent

I agree, lets reduce the number of .h5 by just doing the 3 rank case. That reduction and a test of the check for vem versus requested step count are in now.

PDoakORNL avatar Jun 30 '25 15:06 PDoakORNL

This is dependent on #5566 #5567 I will rebase after those are merged

PDoakORNL avatar Jul 02 '25 17:07 PDoakORNL

Could you explain the dependency on #5567? I'm expecting WalkerLog and Estimators are orthogonal features.

ye-luo avatar Jul 02 '25 19:07 ye-luo

For the integration test they are not since right now it's non optional to constructor the whole WalkerLog bundle if you instantiate a driver.

PDoakORNL avatar Jul 02 '25 21:07 PDoakORNL

Unsurprisingly this needs reference data that is platform aware. After the asan related PR's that is next.

PDoakORNL avatar Jul 02 '25 21:07 PDoakORNL

I don't actually know that amd will be different yet. I will check and if it is not fix this. As far as justification estimator measurement period is a function of the driver, hamiltonian reporting and estimator consumption of hamiltonian values all working together so while I'd like a simpler test at the moment I'm not sure a more stripped down test will work. Probably this test should cover a more complete set of combinations of Hamiltonian operators Even without threading there seems to be somewhere GPU consumes rng in a different sequence or a slight numerical difference flips accept/reject moves and so a larger threshold doesn't suffice. I feel like a a minimum of 2 measurement periods is required to test this feature.

PDoakORNL avatar Jul 07 '25 14:07 PDoakORNL

Creating per-vendor validated test reference numbers is really not designed. Even within products from the same vendor, there is no guarantee of numerical reproducibility.

ye-luo avatar Jul 07 '25 17:07 ye-luo

What is the purpose of the random.h5? Which introduced test needs it? Would like to avoid it.

ye-luo avatar Jul 07 '25 19:07 ye-luo

I will address @ye-luo review, but there is also I remaining testing issue when I run the tests on mi200. Some difference seem like just numerical difference some look like different moves are getting accepted.

rank0 agreement with the v100 ref looks like this:

operator: ElecIon walker:1 { -2.113806938, 1.162369259, } walker:4 { $${\large{\textbf{\textsf{\color{red}0.2739808123,}}}}$$ $${\large{\textbf{\textsf{\color{green}0.2739808125,}}}}$$ 0.297405327, } operator: ElecElec walker:1 { $${\large{\textbf{\textsf{\color{red}-0.4554943055, -0.5220104233, -0.4162087997, -0.3012724656, -0.7265812973, -0.2061104311, -0.352159659, -0.303722474,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.452849358, -0.5225818804, -0.4123574584, -0.2032591192, -0.7265812977, -0.1055201197, -0.3435397006, -0.2946488828,}}}}$$ } walker:4 { $${\large{\textbf{\textsf{\color{red}-0.7089605803, -0.4262367314, -0.6490208787, -0.6542469194, -0.4826099956, -0.4702179075, -0.4796862475, -0.4959751363,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.7089605808, -0.4225666512, -0.6490204947, -0.6546076673, -0.4783006471, -0.4683478746, -0.4794638002, -0.4950603349,}}}}$$ } operator: Kinetic walker:1 { 0.8488909918, -0.3518460699, 1.485935504, 1.02163444, -0.4283702127, 2.328110564, 1.094923655, 1.599094051, } walker:4 { 1.168140642, -0.3450686406, 0.9081663949, 1.01031281, 0.03781160037, 1.282250633, 0.2397257557, 0.6348753282, } operator: ElecIon walker:7 { $${\large{\textbf{\textsf{\color{red}-0.4937755333, 0.8642971625,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.4937755328, 0.8642971624,}}}}$$ } walker:10 { $${\large{\textbf{\textsf{\color{red}0.410152643,}}}}$$ $${\large{\textbf{\textsf{\color{green}0.4101526431,}}}}$$ 0.5417114567, } operator: ElecElec walker:7 { $${\large{\textbf{\textsf{\color{red}-0.6851683994, -0.1192263963, -0.7255631355, -0.2530848878, -0.1310845977, -0.3451655181, -0.5339785562, -0.5378733498,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.6851683999, 2.238918156, -0.725563136, -0.1726253946, 2.231003608, -0.2698314674, -0.5373464564, -0.5359767947,}}}}$$ } walker:10 { $${\large{\textbf{\textsf{\color{red}-0.6621413962, -0.4591966547, -0.5968534321, -0.552657794, -0.5748646648, -0.5901659894, -0.5076327997, -0.5055386297,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.6621362271, -0.4592720129, -0.5959837014, -0.5522704854, -0.5739901699, -0.5911878801, -0.5076435201, -0.5068621576,}}}}$$ } operator: Kinetic walker:7 { 1.06805905, 1.259288999, 0.1311783929, 0.4123548623, 0.6749028478, 1.13878437, -0.164974397, 0.9726460588, } walker:10 { 1.16892002, -0.3363494321, 0.5105193478, 1.011670278, 0.8269880733, 0.6218176553, 0.1916461599, 0.5966342397, } operator: ElecIon walker:13 { -4.374560001, 1.284934002, } operator: ElecElec walker:13 { $${\large{\textbf{\textsf{\color{red}-0.0193890114, -0.3578787737, -0.02565966476, -0.628175936, -0.5702362595, -0.3978279771, -0.4441999799, -0.1979541887,}}}}$$ $${\large{\textbf{\textsf{\color{green}0.2381632958, -0.3439199791, 0.2626880494, -0.627160766, -0.5701165458, -0.3964488044, -0.4317766948, -0.1602230787,}}}}$$ } operator: Kinetic walker:13 { 10.25527995, 1.112702837, 1.376369771, 0.6718486464, 0.7424126159, 0.3418571818, 0.4061795875, 0.385270237, } operator: ElecIon walker:16 { 0.7869419396, $${\large{\textbf{\textsf{\color{red}0.7047362598,}}}}$$ $${\large{\textbf{\textsf{\color{green}0.7047362597,}}}}$$ } operator: ElecElec walker:16 { $${\large{\textbf{\textsf{\color{red}-0.5326092327, -0.05785356302, -0.5843892136, -0.6285802284, -0.2976470684, -0.3974167589, -0.03262853693, -0.2877152974,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.5308092083, 2.128985694, -0.5839132158, -0.628580229, -0.2855121493, -0.3937749721, 2.155115595, -0.2815790879,}}}}$$ } operator: Kinetic walker:16 { 1.416416505, 0.1950978741, -0.1407384392, 0.985801031, -2.066001603, -1.304635594, 4.264746301, 0.7580901924, } operator: ElecIon walker:1 { $${\large{\textbf{\textsf{\color{red}-1.611718971,}}}}$$ $${\large{\textbf{\textsf{\color{green}-1.61171897,}}}}$$ 0.4555253571, } walker:4 { 0.4913218263, 0.3021870185, } operator: ElecElec walker:1 { $${\large{\textbf{\textsf{\color{red}-0.4808385865, -0.7415403276, -0.5224141806, -0.3668030789, -0.640709517, -0.1954838601, -0.3567200788, -0.3981312191,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.4776821509, -0.7455179491, -0.5221759935, -0.3625746853, -0.6407090604, -0.1567079538, -0.3413932856, -0.3799626432,}}}}$$ } walker:4 { $${\large{\textbf{\textsf{\color{red}-0.4938704392, -0.3577700879, -0.3998189634, -0.5880157803, -0.5059498594, -0.3425344148, -0.4586424868, -0.2934155918,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.4924091102, -0.3102570543, -0.3969622553, -0.5882230492, -0.5056553319, -0.2923181419, -0.4509168923, -0.2828592943,}}}}$$ } operator: Kinetic walker:1 { 2.653768369, -0.08959562492, 1.403489141, 1.362340226, -0.02530945429, 1.628331503, 0.1924241299, 0.8637215875, } walker:4 { 0.3809947123, 0.815118003, 1.746560404, 0.7687271503, 0.3482937341, 0.5981378161, -0.2442899414, 0.6155016987, } operator: ElecIon walker:7 { $${\large{\textbf{\textsf{\color{red}-4.225065946, -0.7666097919,}}}}$$ $${\large{\textbf{\textsf{\color{green}-4.225065945, -0.7666097916,}}}}$$ } walker:10 { -1.825591477, $${\large{\textbf{\textsf{\color{red}0.4200889042,}}}}$$ $${\large{\textbf{\textsf{\color{green}0.4200889043,}}}}$$ } operator: ElecElec walker:7 { $${\large{\textbf{\textsf{\color{red}-0.2265670492, -0.1958123425, -0.4742585463, -0.4672445367, -0.4468122404, -0.1184359504, -0.6635593859, -0.2787230019,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.2002650749, -0.1619683266, -0.472219622, -0.4652166381, -0.4301039474, -0.07670485276, -0.666985916, -0.2750470141,}}}}$$ } walker:10 { $${\large{\textbf{\textsf{\color{red}-0.4182748215, -0.4598168708, -0.4684819884, -0.4402202532, -0.5371956742, -0.513784481, -0.5035637364, -0.5719291039,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.4110670495, -0.4546444391, -0.4613898903, -0.4307233298, -0.5367432724, -0.5136592447, -0.5059701444, -0.5737831613,}}}}$$ } operator: Kinetic walker:7 { 4.928479132, 0.8201554144, 0.8145467491, 1.118536731, 0.3533317323, 3.789499434, 1.829937408, 0.9801671226, } walker:10 { 0.9144898038, 1.996084903, 0.6139300904, 1.817648332, 0.4737161281, 1.496927843, 0.2345167341, 0.2701589606, } operator: ElecIon walker:13 { $${\large{\textbf{\textsf{\color{red}-6.113824264, 0.9031039157,}}}}$$ $${\large{\textbf{\textsf{\color{green}-6.113824263, 0.9031039156,}}}}$$ } operator: ElecElec walker:13 { $${\large{\textbf{\textsf{\color{red}0.05641408621, 0.3360785922, 0.03636758295, -0.6596653841, 0.2069696059, -0.4563078874, 0.5152155582, 0.3432228312,}}}}$$ $${\large{\textbf{\textsf{\color{green}0.1085480282, 1.275110254, 0.1805840703, -0.6600145938, 0.3729281802, -0.4574106533, 1.500522892, 0.5129625017,}}}}$$ } operator: Kinetic walker:13 { 4.557073697, 1.5188274, 1.012381383, 0.4056043663, 1.104250985, 0.2246223635, 4.016778419, 0.7806561487, } operator: ElecIon walker:16 { $${\large{\textbf{\textsf{\color{red}-0.7950790307, -0.09958658938,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.7950790304, -0.09958658921,}}}}$$ } operator: ElecElec walker:16 { $${\large{\textbf{\textsf{\color{red}-0.5819086047, -0.6337144502, -0.3210667056, -0.3297344508, -0.6236538794, -0.2212464501, -0.7119794475, -0.7002309344,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.5818697522, -0.6376920718, -0.2721945795, -0.3093248028, -0.6234920858, -0.1543889267, -0.7154069896, -0.7024244044,}}}}$$ } operator: Kinetic walker:16 { 2.220251415, 1.283187776, 1.483814298, -0.08128895007, -0.1329285456, 1.391006299, 0.4933960366, 0.6240248522, } operator: ElecIon walker:1 { $${\large{\textbf{\textsf{\color{red}-0.4491443322, -2.194014847,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.4491443321, -2.194014846,}}}}$$ } walker:4 { -3.153938942, $${\large{\textbf{\textsf{\color{red}-7.565316266,}}}}$$ $${\large{\textbf{\textsf{\color{green}-7.565316265,}}}}$$ } operator: ElecElec walker:1 { $${\large{\textbf{\textsf{\color{red}-0.4817780866, -0.6747114292, -0.5762117079, -0.02249752954, -0.5052530745, -0.02892080713, -0.4230629699, -0.5421297883,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.477584006, -0.6786890508, -0.5742029928, 0.7012160282, -0.5022033621, 0.695843168, -0.4224314274, -0.5421383416,}}}}$$ } walker:4 { $${\large{\textbf{\textsf{\color{red}-0.5913822239, -0.2645956614, -0.6167784551, -0.6105776162, -0.100240388, -0.6692213933, -0.3698159777, -0.271149389,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.5913473533, -0.2347583468, -0.6167363299, -0.6108739304, -0.0434323183, -0.6706216854, -0.3570464308, -0.251401051,}}}}$$ } operator: Kinetic walker:1 { 2.979063528, -0.252384102, 0.8981294147, 2.083801905, -0.01062726717, 2.306985897, 1.764580184, 0.6587003395, } walker:4 { 5.734966851, 1.340090674, 0.6782811171, 0.2157491523, 11.68847484, 0.03206637096, 1.218968638, 1.275765734, } operator: ElecIon walker:7 { $${\large{\textbf{\textsf{\color{red}-3.157829301, -1.205983631,}}}}$$ $${\large{\textbf{\textsf{\color{green}-3.1578293, -1.20598363,}}}}$$ } walker:10 { -2.537942765, -2.451287802, } operator: ElecElec walker:7 { $${\large{\textbf{\textsf{\color{red}-0.4472415343, -0.2965967422, -0.5591666269, -0.3183908414, -0.5758818876, -0.1489233925, -0.4286864994, -0.6112975221,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.4254699718, -0.2556939462, -0.5590903808, -0.3035338545, -0.5758073543, -0.08370472561, -0.4169558991, -0.6134625703,}}}}$$ } walker:10 { $${\large{\textbf{\textsf{\color{red}-0.4706000904, -0.4629155312, -0.5677997478, -0.1414307832, -0.3677184575, -0.1338115233, -0.1278022677, -0.3388597858,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.4702140018, -0.4613721966, -0.5671334473, -0.01606136124, -0.3621592367, -0.00749421976, -0.007455528749, -0.3303617173,}}}}$$ } operator: Kinetic walker:7 { 3.838915072, 0.1799582864, 0.9262231914, 1.600479667, 2.781314652, 3.789465668, 0.3284701514, -0.11637011, } walker:10 { 1.039525881, 6.485131487, -0.0345605923, 2.043562463, 1.905531295, 0.5849435819, 3.776779791, -0.4726857077, } operator: ElecIon walker:13 { $${\large{\textbf{\textsf{\color{red}-5.076392981, 0.003410119782,}}}}$$ $${\large{\textbf{\textsf{\color{green}-5.07639298, 0.003410120048,}}}}$$ } operator: ElecElec walker:13 { $${\large{\textbf{\textsf{\color{red}-0.03304962176, -0.54588068, -0.4576507199, -0.1466769527, -0.2560095961, -0.2575927669, -0.3453651445, -0.6261756349,}}}}$$ $${\large{\textbf{\textsf{\color{green}0.08408196409, -0.5486145988, -0.455435572, -0.04270354246, -0.1127498415, -0.1145509007, -0.327540178, -0.6278729803,}}}}$$ } operator: Kinetic walker:13 { 9.971156692, 0.3490936456, 0.6976175636, 2.265817391, 0.5911124437, 1.098059564, 2.115186082, 0.4973898555, } operator: ElecIon walker:16 { -3.73880079, $${\large{\textbf{\textsf{\color{red}-0.6362311446,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.6362311444,}}}}$$ } operator: ElecElec walker:16 { $${\large{\textbf{\textsf{\color{red}-0.2143340733, -0.4904386797, -0.4521406379, -0.02034751606, -0.08360034465, -0.2267129624, -0.4255339721, -0.2741247048,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.1992260245, -0.4827940598, -0.4486311435, 0.07178845753, -0.05568024486, -0.2147349711, -0.4172425, -0.2029137505,}}}}$$ } operator: Kinetic walker:16 { 8.52362587, 1.060722389, 0.9434807095, 0.570261192, 0.3135366348, 1.568226881, 1.71608573, 0.1539185491, } operator: ElecIon walker:1 { $${\large{\textbf{\textsf{\color{red}0.7882111429, -3.785808419,}}}}$$ $${\large{\textbf{\textsf{\color{green}0.7882111426, -3.785808418,}}}}$$ } walker:4 { $${\large{\textbf{\textsf{\color{red}-0.07003005234, -5.447123731,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.0700300521, -5.44712373,}}}}$$ } operator: ElecElec walker:1 { $${\large{\textbf{\textsf{\color{red}-0.5074087302, -0.4187784896, -0.1305306507, -0.2802728965, -0.6134192419, -0.116613579, -0.3212253621, -0.3196259711,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.5058844673, -0.3330559939, -0.03968678395, -0.2671048158, -0.6130124226, -0.01939312295, -0.3061062052, -0.2320035666,}}}}$$ } walker:4 { $${\large{\textbf{\textsf{\color{red}-0.2148876054, -0.1668156791, -0.2740446661, -0.02168598547, -0.2889636659, -0.6234911465, -0.1515233824, -0.5058982221,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.1795003121, -0.1427757353, -0.2040020352, 0.01921948818, -0.2702290316, -0.6248930622, -0.07608760539, -0.5070697142,}}}}$$ } operator: Kinetic walker:1 { 1.51922084, -0.3610777291, 1.376972806, 4.898207613, 0.4389668808, 1.955994139, 1.861251984, -0.03894454746, } walker:4 { 2.799120385, 1.32958584, 0.765856286, 5.305450359, 0.8936968638, 1.351616348, 0.9800761602, 0.5668658149, } operator: ElecIon walker:7 { $${\large{\textbf{\textsf{\color{red}-0.574540053,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.5745400529,}}}}$$ -1.297789347, } walker:10 { -1.633106685, -1.738241051, } operator: ElecElec walker:7 { $${\large{\textbf{\textsf{\color{red}-0.6092717807, -0.3572180582, -0.529096525, -0.4838366953, -0.3161063286, -0.5512093459, -0.6381923892, -0.5064554539,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.6092388531, -0.3147956821, -0.5285543341, -0.4827635191, -0.2681252944, -0.5515679247, -0.641471963, -0.5070180305,}}}}$$ } walker:10 { $${\large{\textbf{\textsf{\color{red}-0.3130116032, -0.2280427472, -0.4042660033, -0.5729169376, -0.321940752, -0.44307527, -0.203242945, -0.3704391026,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.265472743, -0.1806558064, -0.397763949, -0.572550564, -0.2336423069, -0.4406921322, -0.1173278058, -0.3662981212,}}}}$$ } operator: Kinetic walker:7 { -0.4391857337, 1.236952458, 0.9369257958, 0.8907487414, 2.587801665, 4.007381588, -0.1490964657, 0.7523280883, } walker:10 { 0.3473757643, 1.858979917, 1.458943784, 0.468944545, 1.1628161, 1.199393864, 2.853080972, 1.166539433, } operator: ElecIon walker:13 { $${\large{\textbf{\textsf{\color{red}-5.715823238, 0.06638398088,}}}}$$ $${\large{\textbf{\textsf{\color{green}-5.715823237, 0.06638398083,}}}}$$ } operator: ElecElec walker:13 { $${\large{\textbf{\textsf{\color{red}0.0616767239, -0.6311385231, -0.6118433973, -0.3238026207, -0.1800396803, -0.6931387361, -0.1960699706, -0.2140459752,}}}}$$ $${\large{\textbf{\textsf{\color{green}0.2257051729, -0.6347335782, -0.6118422514, -0.3123478638, -0.03550129243, -0.6945492871, -0.1722618797, -0.1936410002,}}}}$$ } operator: Kinetic walker:13 { 11.96459299, -0.4204859552, 0.1001664229, 1.501411346, 1.038065181, 2.450993767, 2.523064955, 1.6948542, } operator: ElecIon walker:16 { -3.05870631, $${\large{\textbf{\textsf{\color{red}0.795745204,}}}}$$ $${\large{\textbf{\textsf{\color{green}0.7957452038,}}}}$$ } operator: ElecElec walker:16 { $${\large{\textbf{\textsf{\color{red}-0.3910890036, -0.6082046394, -0.2513795155, -0.6709108779, -0.3989067361, -0.1105767193, -0.4815399511, -0.5692695336,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.3860908291, -0.6120631646, -0.05385402461, -0.6712016992, -0.3962446136, 0.09037081876, -0.4839467025, -0.5698091244,}}}}$$ } operator: Kinetic walker:16 { 8.664964285, -0.09627562384, 1.186953593, 0.4565101976, 0.3220329672, 0.8904612137, 0.7502740122, 0.1079382266, } operator: ElecIon walker:1 { 0.6636870228, $${\large{\textbf{\textsf{\color{red}-2.257778256,}}}}$$ $${\large{\textbf{\textsf{\color{green}-2.257778255,}}}}$$ } walker:4 { $${\large{\textbf{\textsf{\color{red}-0.8794778975,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.8794778972,}}}}$$ -4.543718053, } operator: ElecElec walker:1 { $${\large{\textbf{\textsf{\color{red}-0.316695301, -0.4267629038, -0.07270154425, -0.3934689222, -0.3039742001, -0.01312719371, -0.6126227383, -0.01375182832,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.2304322057, -0.4272945659, 0.4293342366, -0.3407810567, -0.2484429698, 0.106176892, -0.6159855391, 0.5028899067,}}}}$$ } walker:4 { $${\large{\textbf{\textsf{\color{red}-0.3971074417, -0.4949702583, -0.08442609171, -0.1307971062, -0.5650468863, -0.4945585794, -0.0887963336, -0.3549649261,}}}}$$ $${\large{\textbf{\textsf{\color{green}-0.3686747541, -0.498743805, 0.04541805897, -0.09510391613, -0.5646031757, -0.4932337872, 0.0887536628, -0.299869067,}}}}$$ } operator: Kinetic walker:1 { 0.9254703611, 0.727897967, 1.258623679, 0.3283654725, 0.619953765, 1.765044052, 0.9155644913, 1.219969051, } walker:4 { 2.733587085, 0.7474187357, 0.8231803717, 5.606174714, 0.1123183281, 0.3989132328, 1.55613925, 2.408800723, }

PDoakORNL avatar Jul 14 '25 16:07 PDoakORNL

@PDoakORNL the testing issue you saw is not unique to GPU. Even on CPU subtle compiler flag, library changes may cause divergent random walking trajectories. Here is what you can try

  1. If it is a VMC run, set warmupsteps to 0.
  2. set steps to 1 and fewer blocks as possible.
  3. Choose smaller timestep.
  4. Use fewer threads/crowds and MPI ranks.

ye-luo avatar Jul 14 '25 17:07 ye-luo

The test is 3 ranks since I don't think testing 1 or 2 really proves much for estimator integration multiple ranks and multiple crowds are needed. It's only sort of single threaded since he crowds are run serially so multiple crowds can be tested without computation time races. But the hacky way I've defined the test in CMAKE leaves omp with threads to create races in hidden omp loops, which I know their are a few of yet. I feel like we need a measurement period of 3 or more and at least two periods to test this. I'm going to address your review first since I think we should merge this before all the testing is sorted out since this is clearly a bit tricky.

PDoakORNL avatar Jul 14 '25 17:07 PDoakORNL

Since the testing is clearly painful I'll note again that once the main implementation is agreed upon that we could go ahead and merge that independent of more sophisticated testing. That can be sorted later. If the implementation is buggy things will be wildly wrong. Very simple testing would be OK to get started e.g. CPU 1 thread 1 mpi, do 2 steps and 2 blocks with the measurement period set at either one or two and check the resultant accumulated weights of a simple observable are either 4 or 2. @PDoakORNL your call.

prckent avatar Jul 14 '25 20:07 prckent

Forgot to mention, you may only test full precision builds by skipping them in mixed precision builds.

ye-luo avatar Jul 14 '25 21:07 ye-luo

Making a PR for 4.

I may or may not have the time in the future to fight about splitting the 112 line run method of DMCBatched and shifting state values to a passed struct that surfaces actual data dependency. It introduces long access expressions but that is preferable to simple looking access that covers up this issue.

PDoakORNL avatar Jul 15 '25 15:07 PDoakORNL

Noting the implementation was merged in #5574

prckent avatar Jul 16 '25 15:07 prckent

Any chance we could use a shorter parameter name than "estimator_measurement_period" (perhaps "estimator_period")?

Very long names make the input file (and Nexus input) unwieldy for users.

The most used (and user friendly) codes, e.g. VASP, use short names.

jtkrogel avatar Jul 16 '25 18:07 jtkrogel

Since I tend to think self documenting variable names are better, I just extended that to the parameter name. @jtkrogel "estimator_period" seems fine since this PR is already in maybe an issue should be made or a PR and objectors could discuss it there.

PDoakORNL avatar Jul 16 '25 18:07 PDoakORNL

Since the feature is in and the testing has surfaced some more issues re. determinacy of the code wrt threads and ranks I'm going to close this PR

PDoakORNL avatar Jul 16 '25 18:07 PDoakORNL

No problem. I'll open an issue for further discussion.

jtkrogel avatar Jul 16 '25 19:07 jtkrogel