climpred icon indicating copy to clipboard operation
climpred copied to clipboard

how to mudularize `PredictionEnsemble.bootstrap()`

Open aaronspring opened this issue 5 years ago • 5 comments

we need bootstrapping to get a measure of significance. but skill we just want to calculate from the original (not resampled) data.

we agree that from a computation point of view, we first resample and then calc skill on the resampled data. how to we want to implement this in our classes?

I see two ways with iterations=100:

    1. create a second PredictionEnsemble object pmb to calc p value there
pm = climpred.classes.PerfectModelEnsemble(ds) #similar for HindcastEnsemble
pm = pm.add_control(control)
 
# create a new pm object with iteration dimension
resample_func=resample
pm = pm.generate_uninitialized()
# pm.bootstrap will handle this
pmb = pm.copy()
pmb._datasets['initialized'] = resample_func(pmb._datasets['initialized'], iterations)
pmb._datasets['uninitialized'] = resample_func(pmb._datasets['uninitialized'], iterations)
...
skill = pm.verify()
rskill = pmb.verify(reference='historical')
pvalue = func(rskill)

pro: ? con: new created object, easy to rerun pm and pmb separately

    1. resample 99 items and keep original as iteration=1 (total 100 iterations)
pm = climpred.classes.PerfectModelEnsemble(ds) #similar for HindcastEnsemble
pm = pm.add_control(control)
 
# create a new pm object with iteration dimension
resample_func=resample
pm = pm.generate_uninitialized()
# pm.bootstrap() will handle this
pm._datasets['initialized'] = xr.concat([pm._datasets['initialized'], resample_func(pmb._datasets['initialized'], iterations-1)],'iteration')
pm._datasets['uninitialized'] = resample_func(pmb._datasets['uninitialized'], iterations)
...
skill = pm.isel(iteration=0).verify()
rskill = pm.isel(iteration=slice(1,None)).verify(reference='historical')
pvalue = func(rskill)

pro: only one object con: annoying to remember first iteration original data

Comments:

  • pros and cons are very causal
  • Likely, we wont name it pm.bootstrap anyways but rather pm.resample(iterations)
  • I think this question is also independent of the refactoring, as we have to deal with this from a code design point of view anyways

Your opinions? @bradyrx @ahuang11

aaronspring avatar Jun 05 '20 19:06 aaronspring

@aaronspring, is this still relevant? Just going through Github notifications now. Learning to use them since my email inbox just gets piled up with Github stuff.

To re-iterate an exchange with Steve Yeager, which is my goal for modularizing the bootstrapping methods:

Thanks a lot for your talk last week. I think it was extremely useful, and it's great to see the progress you and Aaron have made. I like the idea of several basic bootstrapping methods that the user can daisy-chain together. Your proposed functions below sound good, and from my perspective, a basic toolkit would include methods to:

  1. resample across initialization and member dimensions -option for resampling with or without replacement -option for including a "block size" when resampling across initialization dimension, to keep the indices contiguous over a given block of N years
  2. compute skill scores on resampled arrays, yielding a pdf of scores for each dimension that doesn't collapse upon computation of skill (e.g., lat, lon, depth, etc)
  3. test null hypothesis given skill score pdf
  4. determine field significance given maps of score pdfs

Not sure if I already forwarded you the fortran code I used for Figs. 1,2 of Yeager et al. (2018)? /glade/u/home/yeager/ncl/WRAPIT_lib/DPSTATS/DP_corr_xyz_pers_gea13.hilo.OMP.pval.f90 Note that my resampling is performed in index space, so I'm not actually creating copies of the basic data array. From your message below, it sounded like you are creating a copy of the basic data array for each resampling iteration, which could be why large iterations are giving you so much trouble?

bradyrx avatar Sep 09 '20 14:09 bradyrx

I think this is still relevant. Currently we just have my Do it all bootstrap function. Before breaking it down in smaller units I think we need to cover my post above.

aaronspring avatar Sep 09 '20 20:09 aaronspring

I think we first need to have a plan here how the API should finally look like.

My proposal: goal is to get skill from hindcast.verify(...) as now but also p whether significant based on Goddard.

skill = hindcast.verify(...) # no iteration
hindcast_resampled = hindcast.resample_iterations(replace=True, iterations=10, blocksize=maxlead) # return iterations also for uninitialized if present
resampled_skill = hindcast_resampled.verify()
p_or_ci = resampled_skill.get_p_value()
# shortcut
p = hindcast.resample_iterations(replace=True, iterations=10).verify(...).get_p_value()
skill.where(p).plot() # mask and plot

for the above implementation we only need to add resample_iterations() (resample already taken) get_p_value to PredictionEnsemble which we basically have in functional form. When I think about it now, it doesn't look too difficult.

I think we should have this clean BEFORE implementing this feature into climpred-skeleton, because these are connected but still separable challenges.

aaronspring avatar Sep 30 '20 14:09 aaronspring

all this resampling works nice for resample_dim='member' but not for resample_dim='init' which is more important

aaronspring avatar Aug 08 '21 16:08 aaronspring

why not speed up by doing mean(dim) after resample(verify(dim=[]))? way faster way to bootstrap compared to computing verify on new dimension or in a loop. doesnt work for pearson_r though

#if resample_dim=='init' and 'init' in dim and metric != 'pearson_r'
resample_dim='init'
skill = he.generate_uninitialized().verify(metric=metric, comparison='m2o', dim=[], reference=references, alignment=alignment)
resampled = resample_func(skill,iterations,dim=resample_dim).mean(dim)
resampled.sst.plot(hue='iteration',col='skill')

image

aaronspring avatar Feb 16 '22 15:02 aaronspring