alchemlyb icon indicating copy to clipboard operation
alchemlyb copied to clipboard

WIP: bootstrapping submodule, parameter for estimators

Open dotsdl opened this issue 4 years ago • 9 comments

This PR addresses #80.

We have added an alchemlyb.preprocessing.bootstrapping submodule. The functions exposed by this module should work on both dHdl and u_nk standard forms, with a similar usage feel to the alchemlyb.preprocessing.subsampling functions. The following are needed to complete this PR:

  • [ ] tests
  • [ ] usage example(s)
  • [ ] preprocessing documentation for bootstrapping submodule
  • [ ] addition to base estimator class
  • [ ] at least one block bootstrapping implementation

In addition to the function(s), we additionally want to add a bootstrapping keyword (or set of keywords) to each estimator's constructor, which changes how uncertainties are estimated internally to the estimator at the expense of more compute. This will be the approach we will push users toward using for simplicity, but we still want to expose the building blocks by way of thebootstrapping module functions to allow for scaling out if needed (e.g. with dask).

As for block bootstrapping, we want to include an approach that bootstraps whole blocks of a timeseries instead of individual samples from that timeseries. Blocks should be defined based on the correlation time determined from the timeseries, or alternatively specified by the user. There are at least a few ways to do this, so this may require more debate and development. The advantage of block bootstrapping is that the resulting bootstrapped samples maintain correlation within their blocks, which may be desirable in some cases.

dotsdl avatar Jan 20 '20 23:01 dotsdl

Codecov Report

Merging #94 into master will decrease coverage by 2.05%. The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #94      +/-   ##
==========================================
- Coverage   97.27%   95.21%   -2.06%     
==========================================
  Files          12       13       +1     
  Lines         696      711      +15     
  Branches      141      143       +2     
==========================================
  Hits          677      677              
- Misses          5       20      +15     
  Partials       14       14
Impacted Files Coverage Δ
src/alchemlyb/preprocessing/bootstrapping.py 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ed6cb72...e8db134. Read the comment docs.

codecov[bot] avatar Jan 20 '20 23:01 codecov[bot]

Added a gist on how this works currently. Comments welcome!

dotsdl avatar Jan 21 '20 07:01 dotsdl

I already like the gist and the dask example. However, shouldn't you decorrelate your data first and then bootstrap the decorrelated data?

For the "blocks" approach: Do I have to think about a trajectory of blocks where each block is the slice information to get the original data, something like slices = [[start0, stop0, step0], [start1, stop1, step1], ..., [startN_1, stopN_1, stepN_1]] so that data[slices] produces N blocks. Then you bootstrap the indices = arange(len(slices)) of slices to reassemble a subset of the original data as data[slices[bootstrapped_indices]]?

orbeckst avatar Jan 23 '20 23:01 orbeckst

I already like the gist and the dask example. However, shouldn't you decorrelate your data first and then bootstrap the decorrelated data?

For vanilla bootstrap, yes.

For block bootstrap, no -- though to determine the length of the blocks, you do need to know the correlation length.

mrshirts avatar Jan 24 '20 00:01 mrshirts

If I read the gist correctly then there "vanilla bootstrap" is performed so if the gist is turned into documentation then data decorrelation should probably included so that copy&paste will at least produce conservative estimates.

From the description in the issue above I still don't understand how block bootstraps will be used. A simple example might be helpful.

orbeckst avatar Jan 24 '20 00:01 orbeckst

@orbeckst thanks! My intent is for the gist to become our actual example for bootstrap usage in the docs, so I am working on adding use of statistical_inefficiency in the loop for the Bootstrapping + estimation section onward.

However, it's looking like this isn't straightforward with the current implementation limitations of our subsamplers. I think it may be due to a combination of bitrot by way of changes in how pandas behaves with multi-index resetting and my rather limiting assumptions when I wrote the subsamplers way back when. I'm going to proceed with getting this right by making the subsampling functions work just fine with full DataFrames like what we get from alchemtest.gmx.load_expanded_ensemble_case_1. We need this since it is proving to be too painful pandas-fu to work around it. This will be a separate PR.

I believe you have the right idea on how block bootstrapping works. From @mrshirts, there are a few variations on it, however. Here is a reference I am working from.

dotsdl avatar Jan 24 '20 05:01 dotsdl

The gist has been updated; it requires components of #98, which can be played with on this branch. This addresses @orbeckst's comments on decorrelation for when this notebook becomes the first example doc for bootstrapping.

dotsdl avatar Jan 29 '20 06:01 dotsdl

I think pymbar 4 now contains bootstrapping natively (did I understand this correctly, @mrshirts ?). Is this PR still needed when we move to using pymbar 4.x #207 ?

orbeckst avatar Jul 26 '22 17:07 orbeckst

GREAT question. If you are bootstrapping a single free energy calculation, or a single expectation. pymbar 4 handles it all, and faster than alchemlyb could wrapping around MBAR.

If you were doing something like a heat capacity calculation with multiple temperatures reweighting, or where you do some manipulation of the results of multiple free energy calculations, then you really should bootstrap on the outside of that. Entropy + enthalpy are done internally. But if for whatever reason you were calculating k_1 <O_1> - k_2<O_2> , where <O_1> and <O_2> are expectations of different observables from the same simulation and k_1 and k_2 are constants, then you can't bootstrap them separately and add the uncertainties together - you should bootstrap the entire calculation.

If all the functions just are computing a vector f_k from a single single set of files that have been subsampled - then the internal boostrapping should be fine.

mrshirts avatar Jul 26 '22 19:07 mrshirts

PR #322 addressed #320 so we can now use MBAR-native bootstrap.

I feel that this is sufficient for the moment, especially as TI can get error estimates by other means.

Therefore, I am closing this PR, although of course anyone is welcome to resurrect it and take up work on it again.

orbeckst avatar Jun 06 '23 22:06 orbeckst