qmcpack Fast, efficient implementation of SOC

Please review the developer documentation on the wiki of this project that contains help and requirements.

Proposed changes

This PR fixes a longstanding problem with SOC. Traditionally, due to the nature of the TrialWaveFunction which doesn't really exploit the fact that we only really ever use Jastrows * Determinants (or multidets), I had to do the spin integration numerically. This was quite inefficient, since we were essentially doing 8x the work since the default number of spin quadrature points was 8.

The spin integration can be done exactly if you can exploit the fact that the wavefunction is a exp(J) * Det(spinors), which can be further simplified into the ratio jastrows * inverse transpose slater matrix * spinors. Since the spinors are phi_up * exp(is) + phi_dn * exp(-is) you can pull the spin integration inside and do those terms exactly. That removes the need for the spin quadrature all together.

This is done by utilizing the TWFFastForceWrapper which gives convenient access to the appropriate quantities. A simple test problem with 16 walkers, 1000 blocks, 1 thread on my laptop shows the expected speedup.

Traditional slow evaluation: 103.8466 s New implementation: 12.3576 s speedup: 8.4

Right now, this only works with legacy I think. Need to look at mw_ APIs in the future. Had to add some new features to the TWFFastDerivWrapper and the SpinorSet, but everything is unit tested.

What type(s) of changes does this code introduce?

Delete the items that do not apply

New feature

Does this introduce a breaking change?

No

What systems has this change been tested on?

M1 mac

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is simply a reminder of what we are going to look for before merging your code.

Yes. This PR is up to date with current the current state of 'develop'
Yes. Code added or changed in the PR has been clang-formatted
Yes. This PR adds tests to cover any new code, or to catch a bug that is being fixed
Yes. Documentation has been added (if appropriate)

Feb 20 '24 01:02 camelto2

Is it possible to move this optimization to NLPP batched code path where NLPP ratios of dets (without Jastrow) are batched? I mean specialize https://github.com/QMCPACK/qmcpack/blob/a82117fe3bc030e90ea564f2fae407c76e0ef3b9/src/QMCWaveFunctions/SPOSet.h#L205

Feb 20 '24 02:02 ye-luo

Is it possible to move this optimization to NLPP batched code path where NLPP ratios of dets (without Jastrow) are batched? I mean specialize

https://github.com/QMCPACK/qmcpack/blob/a82117fe3bc030e90ea564f2fae407c76e0ef3b9/src/QMCWaveFunctions/SPOSet.h#L205

evaluateDetRatios is used to give you psi(q)/psi(r). But I need it in a slightly different form in order to remove the spin integration and get what I want. Will send you some notes to take a look

Feb 20 '24 16:02 camelto2

Test this please

Feb 20 '24 18:02 prckent

FYI, the more modern compilers are complaining e.g.

src/QMCHamiltonians/tests/test_SOECPotential.cpp:258:54: error: non-const lvalue reference to type 'Real' (aka 'float') cannot bind to a value of unrelated type 'Return_t' (aka 'double')
  testing::TestSOECPotential::evalFast(so_ecp, elec, value)***

Feb 20 '24 19:02 prckent

FYI, the more modern compilers are complaining e.g.

src/QMCHamiltonians/tests/test_SOECPotential.cpp:258:54: error: non-const lvalue reference to type 'Real' (aka 'float') cannot bind to a value of unrelated type 'Return_t' (aka 'double')
  testing::TestSOECPotential::evalFast(so_ecp, elec, value)***

This should be fixed now. There was a type mismatch in my unit test

Feb 20 '24 20:02 camelto2

Not sure I understand why linux (Clang14-NoMPI-UBSan-Real) is failing. Looks like test_parser, which is something I didn't touch

Feb 20 '24 22:02 camelto2

Test this please

Feb 21 '24 03:02 ye-luo

What's the status of this?

Feb 27 '24 14:02 rcclay

A huge improvement and good to get in. Need to identify what is needed for batched code execution in followup work.

Feb 27 '24 16:02 prckent

A huge improvement and good to get in. Need to identify what is needed for batched code execution in followup work.

I have been working on the batched implementation. Should have it submitted in the next few days

Feb 27 '24 17:02 camelto2

(Hadn't noticed the "Unchanged files with check annotations" beta feature before. Would be good to work the matrix of options in subsequent PRs to get coverage on these features since we historically have trouble with this sort of thing.)

Feb 28 '24 15:02 prckent

(Hadn't noticed the "Unchanged files with check annotations" beta feature before. Would be good to work the matrix of options in subsequent PRs to get coverage on these features since we historically have trouble with this sort of thing.)

Where is this feature?

Feb 28 '24 17:02 ye-luo

Where is this feature?

Click your "More static member functions." commit and scroll down.

Feb 28 '24 18:02 prckent

The saving comes from the removal of numerical integration over spinors. Excellent! However, replying on TWFFastDerivWrapper requires SO being aware of the fine structure of TWF in the SO class. It seems cumbersome to me. Also it doesn't work with multdets and need a separate code path. I experimented not relying on TWFFastDerivWrapper https://github.com/camelto2/qmcpack/compare/fast_soc...ye-luo:qmcpack:soc-simplified?expand=1 which allows further expansion to multidet without loss of generality. I will work with @camelto2 offline to sort things out before finalizing this PR.

If speeding up the SOECP evaluation was all we needed or intended to do with this, I agree 100% with Ye's proposed solution. However, next step for this and NonLocalECP is to speed up the derivative evaluation w.r.t. orbital rotation parameters--which is where doing it this way pulls ahead of the proposed alternative method.

Feb 29 '24 23:02 rcclay

The saving comes from the removal of numerical integration over spinors. Excellent! However, replying on TWFFastDerivWrapper requires SO being aware of the fine structure of TWF in the SO class. It seems cumbersome to me. Also it doesn't work with multdets and need a separate code path. I experimented not relying on TWFFastDerivWrapper https://github.com/camelto2/qmcpack/compare/fast_soc...ye-luo:qmcpack:soc-simplified?expand=1 which allows further expansion to multidet without loss of generality. I will work with @camelto2 offline to sort things out before finalizing this PR.

If speeding up the SOECP evaluation was all we needed or intended to do with this, I agree 100% with Ye's proposed solution. However, next step for this and NonLocalECP is to speed up the derivative evaluation w.r.t. orbital rotation parameters--which is where doing it this way pulls ahead of the proposed alternative method.

Parameter derivative calculation and SO evaluation (part of Hamitonian) are in separate driver stages and code paths. They don't necessarily need to overlap in terms of algorithms. I do feel calculating parameter derivatives have a different pattern and probably need this formalism but I don't think much code can be reused from this PR. We will see. For SO evaluation, we do want an implementation that is portable to GPU otherwise users got a huge performance hit when turning on SO.

Mar 01 '24 00:03 ye-luo

@ye-luo Fair point. In that case, I don't see any problems with the solution you've laid out.

Mar 01 '24 02:03 rcclay

Just checking on the status of this. I assume the rework is being done with the limited time available. Any gotchas?

Mar 26 '24 16:03 prckent

Just no time. I reworked it using Ye's approach on my end, and got the unit test passing for the SOECPotential. Just need to finish adapting the unit test for the changes to the SpinorSet. Will hopefully get to it today

Mar 26 '24 17:03 camelto2

Sorry for taking so long, this should be up to date with Ye's recommendations

Apr 04 '24 16:04 camelto2

There is a great deal of per walker data stashed in SpinorSet and the force wrapper. In fact none of this really looks like it took into account the batched design. These should really have been crowd scope memory resources i.e setup through resource sets (which will also allow use to track memory use much more easily). Evaluations should be over these whole blocks and not broken up by walker unless necessary. Certainly the end goal should be a single kernel launch to deal with the force calculations over a block of walkers.

The way this is setup right now assuming a Wrapper object per walker and a SpinsorSet object per walker that a bunch of workspace or additional state can be stashed in really isn't great. Extracting pointers from each walker and its inheritance pile is not a good design and was just a hack to avoid refactoring too much at once.

Apr 04 '24 18:04 PDoakORNL

Thanks Cody. I am hopeful that this gets merged after the CI passes. The real build needs fixing and there are sanitizer failures.

Peter: I think we should see where we end up with e.g. real world memory usage and performance for observables before tackling design issues here.

Apr 05 '24 14:04 prckent

Test this please

Apr 10 '24 03:04 ye-luo

Test this please

Apr 10 '24 04:04 ye-luo

Test this please

Apr 11 '24 23:04 ye-luo