alchemlyb
alchemlyb copied to clipboard
WIP: bootstrapping submodule, parameter for estimators
This PR addresses #80.
We have added an alchemlyb.preprocessing.bootstrapping
submodule. The functions exposed by this module should work on both dHdl
and u_nk
standard forms, with a similar usage feel to the alchemlyb.preprocessing.subsampling
functions. The following are needed to complete this PR:
- [ ] tests
- [ ] usage example(s)
- [ ] preprocessing documentation for
bootstrapping
submodule - [ ] addition to base estimator class
- [ ] at least one block bootstrapping implementation
In addition to the function(s), we additionally want to add a bootstrapping keyword (or set of keywords) to each estimator's constructor, which changes how uncertainties are estimated internally to the estimator at the expense of more compute. This will be the approach we will push users toward using for simplicity, but we still want to expose the building blocks by way of thebootstrapping
module functions to allow for scaling out if needed (e.g. with dask).
As for block bootstrapping, we want to include an approach that bootstraps whole blocks of a timeseries instead of individual samples from that timeseries. Blocks should be defined based on the correlation time determined from the timeseries, or alternatively specified by the user. There are at least a few ways to do this, so this may require more debate and development. The advantage of block bootstrapping is that the resulting bootstrapped samples maintain correlation within their blocks, which may be desirable in some cases.
Codecov Report
Merging #94 into master will decrease coverage by
2.05%
. The diff coverage is0%
.
@@ Coverage Diff @@
## master #94 +/- ##
==========================================
- Coverage 97.27% 95.21% -2.06%
==========================================
Files 12 13 +1
Lines 696 711 +15
Branches 141 143 +2
==========================================
Hits 677 677
- Misses 5 20 +15
Partials 14 14
Impacted Files | Coverage Δ | |
---|---|---|
src/alchemlyb/preprocessing/bootstrapping.py | 0% <0%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update ed6cb72...e8db134. Read the comment docs.
Added a gist on how this works currently. Comments welcome!
I already like the gist and the dask example. However, shouldn't you decorrelate your data first and then bootstrap the decorrelated data?
For the "blocks" approach: Do I have to think about a trajectory of blocks where each block is the slice information to get the original data, something like slices = [[start0, stop0, step0], [start1, stop1, step1], ..., [startN_1, stopN_1, stepN_1]]
so that data[slices]
produces N blocks. Then you bootstrap the indices = arange(len(slices))
of slices
to reassemble a subset of the original data as data[slices[bootstrapped_indices]]
?
I already like the gist and the dask example. However, shouldn't you decorrelate your data first and then bootstrap the decorrelated data?
For vanilla bootstrap, yes.
For block bootstrap, no -- though to determine the length of the blocks, you do need to know the correlation length.
If I read the gist correctly then there "vanilla bootstrap" is performed so if the gist is turned into documentation then data decorrelation should probably included so that copy&paste will at least produce conservative estimates.
From the description in the issue above I still don't understand how block bootstraps will be used. A simple example might be helpful.
@orbeckst thanks! My intent is for the gist to become our actual example for bootstrap usage in the docs, so I am working on adding use of statistical_inefficiency
in the loop for the Bootstrapping + estimation section onward.
However, it's looking like this isn't straightforward with the current implementation limitations of our subsamplers. I think it may be due to a combination of bitrot by way of changes in how pandas
behaves with multi-index resetting and my rather limiting assumptions when I wrote the subsamplers way back when. I'm going to proceed with getting this right by making the subsampling functions work just fine with full DataFrames like what we get from alchemtest.gmx.load_expanded_ensemble_case_1
. We need this since it is proving to be too painful pandas
-fu to work around it. This will be a separate PR.
I believe you have the right idea on how block bootstrapping works. From @mrshirts, there are a few variations on it, however. Here is a reference I am working from.
The gist has been updated; it requires components of #98, which can be played with on this branch. This addresses @orbeckst's comments on decorrelation for when this notebook becomes the first example doc for bootstrapping.
I think pymbar 4 now contains bootstrapping natively (did I understand this correctly, @mrshirts ?). Is this PR still needed when we move to using pymbar 4.x #207 ?
GREAT question. If you are bootstrapping a single free energy calculation, or a single expectation. pymbar 4 handles it all, and faster than alchemlyb could wrapping around MBAR.
If you were doing something like a heat capacity calculation with multiple temperatures reweighting, or where you do some manipulation of the results of multiple free energy calculations, then you really should bootstrap on the outside of that. Entropy + enthalpy are done internally. But if for whatever reason you were calculating k_1 <O_1> - k_2<O_2> , where <O_1> and <O_2> are expectations of different observables from the same simulation and k_1 and k_2 are constants, then you can't bootstrap them separately and add the uncertainties together - you should bootstrap the entire calculation.
If all the functions just are computing a vector f_k from a single single set of files that have been subsampled - then the internal boostrapping should be fine.
PR #322 addressed #320 so we can now use MBAR-native bootstrap.
I feel that this is sufficient for the moment, especially as TI can get error estimates by other means.
Therefore, I am closing this PR, although of course anyone is welcome to resurrect it and take up work on it again.