Fast, efficient implementation of SOC
Please review the developer documentation on the wiki of this project that contains help and requirements.
Proposed changes
This PR fixes a longstanding problem with SOC. Traditionally, due to the nature of the TrialWaveFunction which doesn't really exploit the fact that we only really ever use Jastrows * Determinants (or multidets), I had to do the spin integration numerically. This was quite inefficient, since we were essentially doing 8x the work since the default number of spin quadrature points was 8.
The spin integration can be done exactly if you can exploit the fact that the wavefunction is a exp(J) * Det(spinors), which can be further simplified into the ratio jastrows * inverse transpose slater matrix * spinors. Since the spinors are phi_up * exp(is) + phi_dn * exp(-is) you can pull the spin integration inside and do those terms exactly. That removes the need for the spin quadrature all together.
This is done by utilizing the TWFFastForceWrapper which gives convenient access to the appropriate quantities. A simple test problem with 16 walkers, 1000 blocks, 1 thread on my laptop shows the expected speedup.
Traditional slow evaluation: 103.8466 s New implementation: 12.3576 s speedup: 8.4
Right now, this only works with legacy I think. Need to look at mw_ APIs in the future. Had to add some new features to the TWFFastDerivWrapper and the SpinorSet, but everything is unit tested.
What type(s) of changes does this code introduce?
Delete the items that do not apply
- New feature
Does this introduce a breaking change?
- No
What systems has this change been tested on?
M1 mac
Checklist
Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is simply a reminder of what we are going to look for before merging your code.
- Yes. This PR is up to date with current the current state of 'develop'
- Yes. Code added or changed in the PR has been clang-formatted
- Yes. This PR adds tests to cover any new code, or to catch a bug that is being fixed
- Yes. Documentation has been added (if appropriate)
Is it possible to move this optimization to NLPP batched code path where NLPP ratios of dets (without Jastrow) are batched? I mean specialize https://github.com/QMCPACK/qmcpack/blob/a82117fe3bc030e90ea564f2fae407c76e0ef3b9/src/QMCWaveFunctions/SPOSet.h#L205
Is it possible to move this optimization to NLPP batched code path where NLPP ratios of dets (without Jastrow) are batched? I mean specialize
https://github.com/QMCPACK/qmcpack/blob/a82117fe3bc030e90ea564f2fae407c76e0ef3b9/src/QMCWaveFunctions/SPOSet.h#L205
evaluateDetRatios is used to give you psi(q)/psi(r). But I need it in a slightly different form in order to remove the spin integration and get what I want. Will send you some notes to take a look
Test this please
FYI, the more modern compilers are complaining e.g.
src/QMCHamiltonians/tests/test_SOECPotential.cpp:258:54: error: non-const lvalue reference to type 'Real' (aka 'float') cannot bind to a value of unrelated type 'Return_t' (aka 'double')
testing::TestSOECPotential::evalFast(so_ecp, elec, value)***
FYI, the more modern compilers are complaining e.g.
src/QMCHamiltonians/tests/test_SOECPotential.cpp:258:54: error: non-const lvalue reference to type 'Real' (aka 'float') cannot bind to a value of unrelated type 'Return_t' (aka 'double') testing::TestSOECPotential::evalFast(so_ecp, elec, value)***
This should be fixed now. There was a type mismatch in my unit test
Not sure I understand why linux (Clang14-NoMPI-UBSan-Real) is failing. Looks like test_parser, which is something I didn't touch
Test this please
What's the status of this?
A huge improvement and good to get in. Need to identify what is needed for batched code execution in followup work.
A huge improvement and good to get in. Need to identify what is needed for batched code execution in followup work.
I have been working on the batched implementation. Should have it submitted in the next few days
(Hadn't noticed the "Unchanged files with check annotations" beta feature before. Would be good to work the matrix of options in subsequent PRs to get coverage on these features since we historically have trouble with this sort of thing.)
(Hadn't noticed the "Unchanged files with check annotations" beta feature before. Would be good to work the matrix of options in subsequent PRs to get coverage on these features since we historically have trouble with this sort of thing.)
Where is this feature?
Where is this feature?
Click your "More static member functions." commit and scroll down.
The saving comes from the removal of numerical integration over spinors. Excellent! However, replying on TWFFastDerivWrapper requires SO being aware of the fine structure of TWF in the SO class. It seems cumbersome to me. Also it doesn't work with multdets and need a separate code path. I experimented not relying on TWFFastDerivWrapper https://github.com/camelto2/qmcpack/compare/fast_soc...ye-luo:qmcpack:soc-simplified?expand=1 which allows further expansion to multidet without loss of generality. I will work with @camelto2 offline to sort things out before finalizing this PR.
If speeding up the SOECP evaluation was all we needed or intended to do with this, I agree 100% with Ye's proposed solution. However, next step for this and NonLocalECP is to speed up the derivative evaluation w.r.t. orbital rotation parameters--which is where doing it this way pulls ahead of the proposed alternative method.
The saving comes from the removal of numerical integration over spinors. Excellent! However, replying on TWFFastDerivWrapper requires SO being aware of the fine structure of TWF in the SO class. It seems cumbersome to me. Also it doesn't work with multdets and need a separate code path. I experimented not relying on TWFFastDerivWrapper https://github.com/camelto2/qmcpack/compare/fast_soc...ye-luo:qmcpack:soc-simplified?expand=1 which allows further expansion to multidet without loss of generality. I will work with @camelto2 offline to sort things out before finalizing this PR.
If speeding up the SOECP evaluation was all we needed or intended to do with this, I agree 100% with Ye's proposed solution. However, next step for this and NonLocalECP is to speed up the derivative evaluation w.r.t. orbital rotation parameters--which is where doing it this way pulls ahead of the proposed alternative method.
Parameter derivative calculation and SO evaluation (part of Hamitonian) are in separate driver stages and code paths. They don't necessarily need to overlap in terms of algorithms. I do feel calculating parameter derivatives have a different pattern and probably need this formalism but I don't think much code can be reused from this PR. We will see. For SO evaluation, we do want an implementation that is portable to GPU otherwise users got a huge performance hit when turning on SO.
@ye-luo Fair point. In that case, I don't see any problems with the solution you've laid out.
Just checking on the status of this. I assume the rework is being done with the limited time available. Any gotchas?
Just no time. I reworked it using Ye's approach on my end, and got the unit test passing for the SOECPotential. Just need to finish adapting the unit test for the changes to the SpinorSet. Will hopefully get to it today
Sorry for taking so long, this should be up to date with Ye's recommendations
There is a great deal of per walker data stashed in SpinorSet and the force wrapper. In fact none of this really looks like it took into account the batched design. These should really have been crowd scope memory resources i.e setup through resource sets (which will also allow use to track memory use much more easily). Evaluations should be over these whole blocks and not broken up by walker unless necessary. Certainly the end goal should be a single kernel launch to deal with the force calculations over a block of walkers.
The way this is setup right now assuming a Wrapper object per walker and a SpinsorSet object per walker that a bunch of workspace or additional state can be stashed in really isn't great. Extracting pointers from each walker and its inheritance pile is not a good design and was just a hack to avoid refactoring too much at once.
Thanks Cody. I am hopeful that this gets merged after the CI passes. The real build needs fixing and there are sanitizer failures.
Peter: I think we should see where we end up with e.g. real world memory usage and performance for observables before tackling design issues here.
Test this please
Test this please
Test this please