dump walkers periodically for post-processing
Proposed changes
Repurpose unused input "record_configs" to set frequency of dumping walkers for post-processing.
Example input of a simple harmonic oscillator tb38_msho.zip
A normal checkpoint file {prefix}.config.h5 is overwritten at every checkpoint and contains
block Dataset {SCALAR}
number_of_walkers Dataset {SCALAR}
walker_partition Dataset {2}
walker_weights Dataset {65}
walkers Dataset {65, 1, 3}
The walkers dataset contains the snapshot at checkpoint.
With "record_configs" > 0, each addition to the config.h5 is preserved. Instead of one walkers dataset, we need to identify the block at which the walkers are dumped
block Dataset {SCALAR}
number_of_walkers Dataset {SCALAR}
walker_partition Dataset {2}
walker_partition32 Dataset {2}
walker_partition64 Dataset {2}
walker_weights Dataset {64}
walker_weights32 Dataset {63}
walker_weights64 Dataset {64}
walkers Dataset {64, 1, 3}
walkers32 Dataset {63, 1, 3}
walkers64 Dataset {64, 1, 3}
The number 32 in the name walkers32 identify that these walkers were dump at block 32.
What type(s) of changes does this code introduce?
- New feature
Does this introduce a breaking change?
- No
What systems has this change been tested on?
Intel workstation
Checklist
Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is simply a reminder of what we are going to look for before merging your code.
- Yes. This PR is up to date with current the current state of 'develop'
- No. Code added or changed in the PR has been clang-formatted
- Yes/No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
- Yes/No. Documentation has been added (if appropriate)
Hi Paul. Thanks for adding/restoring this ghost feature. As we have discussed this functionality is useful for a variety of uses, and clearly the functionality could be evolved further.
For now, two questions:
- How much testing have you done that the energies and weights are correct and that, when averaged, they match what is in the scalar files?
- Does this work with vmc_batched?
The asan failure was found caused by CI OS image. It has been fixed.
I also noticed that the added change only affect legacy driving. For my curiosity, what is missing for you to adopt the batched driver?
With regards to batched I want to remove WalkerConfiguration as an object and heavily refactor eliminate the "Walker-->Particle" objects. The Walker object and some of its data possessions which are stuffed into particle set is stuffed full of values that should be contiguous vectors in the crowds or even at rank scope but are instead here in an AOS with respect to walkers and mapped into a byte buffer. As it stands all this "dumping walkers" stuff is predicated on that. Pointer arithmetic is done etc. The less dependency there is is on this the better. I think we would be better off writing an "estimator" that wrote the per walker values for data out. This exposes legacy walker implementation details that we don't want to carry forward either, to the extent that some still exist that is because refactoring ParticleSet has been blocked for several years now.
If you want to dump walkers it should be a crowd or population level operation so it can be synchronized properly and done efficiently. If you want to use parallel hdf5 the appropriate level would be at MCPopulation. Ideally it would be done from a separate io thread while real work is being done by the main crowd threads.
@prckent apologies for the delayed response. To answer your questions:
How much testing have you done that the energies and weights are correct and that, when averaged, they match what is in the scalar files?
I did not test the energies and weights. The same dump function used for a normal checkpoint is called, so I just assumed it would be correct.
Does this work with vmc_batched?
No, I did not add this feature to the batched drivers.
The current implementation breaks down in an MPI run. I think this is due to problems with parallel hdf5 file writing.
@ye-luo I added some more details of this feature in the PR description. Does this clear up the meaning of identify_block?
I will also add some comments in the source code.
@ye-luo @PDoakORNL I would love some help on getting this feature to work. It's not clear to me exactly what needs to be done to make parallel HDF5 writing correct.
Hi Paul - Is #5019 working for you? If so, we can close this PR.
@prckent please keep this PR alive. it is needed for different uses like reply.
@prckent yes, #5019 does what I need. It outputs a bit too much information, but it seems straight-forward to modify so I can go from there. I don't mind if you decide to close this PR.
@ye-luo would it be easier to add a "replay" flag to the new WalkerLog class? I think it just needs to output walker coordinates and weights at every step without all the other debugging contents.
WalkerLog has very different design logics and use cases. For example, it writes one file per MPI rank instead of using scalable I/O. Data are not organized in steps and thus requires extra work to recover a true reply.
I think this is no longer needed and can be closed.
I think a more refined version can be implemented when properly implementing a reply feature or correlated sampling.