arch icon indicating copy to clipboard operation
arch copied to clipboard

Bootstrap prediction interval

Open david-waterworth opened this issue 4 years ago • 9 comments

I'm trying to bootstrap the prediction interval for an sklearn style non-parametric estimator (specifically catboost regression). I want the prediction interval for the sum of n predictions (the data has time order). I'm using the method from Davison and Hinkley "Bootstrap Methods and their Applications" algorithm 6.4 i.e.

fit the model to X,y with residuals e
repeat R times
  X* = X, e*=sample(e), y*=F(X*) + e*
  fit y*=F'(X*)
  repeat M times
     e+=sample(e,n)
     Yhat-Y+ = sum(F(X*)-(F'(X*)+e+))

As far as I can see, there's no convience wrapper for a prediction interval similar bs.conf_int()?

Also like in https://github.com/bashtage/arch/issues/165 I only want the first n samples for e+ - I assume I still have to use the workaround from this issue?

My main question though is how to implement the two loops? I don't think creating two seperate CircularBlockBootstrap() instances is correct though since the random state of the inner bootstrap will be reset to whatever the default seed is? I think what I should do is create one bootstrap instance and use it as follows:

bs=CircularBlockBootstrap(block_size,e)
R=10
M=10

it = bs.bootstrap(N*(M+1))

for r in range(R):
    pos, _ = next(it)
    estar = pos[0]
    
    for m in range(M):
        pos, _ = next(it)
        eplus = pos[0][:n]  

My issue though is when I sample estar and eplus - because the data is blocked I need to ensure the block boundaries are the same - specifically if the block size is one week and X and Xplus don't both start on the same hour of the week then sampling from the same CircularBlockBootstrap() object will result in misalignment of the eplus samples?

Edit: As is usually the case, taking the time to write out the problem gave me enough time to think it out. I compute the offset from the start of e to the same time of week of the start of the X+ and then simply sample as follows:

    eplus = pos[0][offset:n+offset]  

david-waterworth avatar Nov 09 '19 04:11 david-waterworth

If e has seasonality in it, then CBB and X* = X, e*=sample(e), y*=F(X*) + e* won't build a time series with the same properties as the original data. In particular, it won't have the seasonality in its residuals. So this might be a problem.

I don't think the second (inside) set of shocks need to be from the same bootstrap as the original one, so you could use an independent bootstrap here.

If you have seasonality in your residuals, then you need to make sure they are always aligned in both the series generation and the forecast errors. Conceptually you could think of a seasonal time series in seasonal time. Suppose you had 2400 obs with a 24-hour seasonality so that you have 100 full periods. You can think of this as 100 observations of 24 hours. You could then CBB the 100 observations, use this to get the index of the start of the seasonality, and then sample from original data. For example, if a CBB gave you

4,5,6,7,1,2,3

Then you would use obs

4*24, 4*24+1,...4*24+23,5*24,...,5*24+23,6*24,...,6*24+23,7*24,...,7*24+23,1*24,1*24+1,...,1*24+23,...

bashtage avatar Nov 09 '19 09:11 bashtage

If you don't have seasonality, then I think you could do something like

  1. Create your bootstrap for e
  2. Sample t + m * n where t is the original sample size, m is the number of internal bootstraps and n is the forecast horizon
  3. Compute the value to return from the t (regenerate original-like sample) and m forecasts of n observations

This would turn it into a "standard" problem. Essentially you would create all of the required shocks

all_e = [estar;eplus[0], eplus[1],...,eplus[m-1]]

in a single go from the bootstrap sampler.

This would then be run R times to produce the confidence interval(s).

bashtage avatar Nov 09 '19 10:11 bashtage

Thanks,I realised I made an assumption as to the property of the CBB which is incorrect. My data does have daily seasonality. I incorrectly assumed the CBB essentially did a block random shuffle a bit like sklearn.model_selection.GroupKFold

Hence if each block represents 24 hours the start of each block will always be 0 hours.

If I understand you correctly - I think this is equivalanet to what you suggest? I convert from a vector of length 24*365 (i.e. 1 year of hourly observations) to a 24x365 matrix, that way the CBB will behave like the GroupKFold i.e.

# simulate series of length 16 with 4 sample seasonality
bs=CircularBlockBootstrap(1,np.array([[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15]]))
for pos, kw in bs.bootstrap(3):
    print(pos[0])

[with a block size of one is this basically the same as IIDBootstrap?]

david-waterworth avatar Nov 10 '19 00:11 david-waterworth

This is correct - vector sampling the data will capture the seasonality. The only other change is that you need to adjust the bandwidth so that it is in seasonal periods, rather than in observation periods.

bashtage avatar Nov 10 '19 00:11 bashtage

Thanks, I'm not sure what you mean by adjusting the bandwidth - do you mean the first argument to CircularBlockBootstrap (which I have set to 1)?

I will have to adjust ensure that I align the series generation and the forecast errors - I beleive I can do this by rotating the vector I'm sampling from before reshaping into matrix form so that the first element has the same time of day as the first element in the corresponding x* and x+ matrices

david-waterworth avatar Nov 10 '19 01:11 david-waterworth

Yes, the BW is 1 in your example. a BW of 1 is an IID bootstrap. In your problem, BW should reflect any data dependence across seasonal periods. If days are effectively independent, when you can use 1 (or IIDBootstrap on the array). If there is some heteroskedasticity or other dependence across days, then you should use a larger number.

Yes, you will need to align them in your function that outputs this quantity: Yhat-Y+ = sum(F(X*)-(F'(X*)+e+))

bashtage avatar Nov 10 '19 08:11 bashtage

If you have the time, it would be great to have a writeup in a notebook of the key steps. You can use fake data (simulate as part of the notebook) it is makes things easier.

bashtage avatar Nov 10 '19 08:11 bashtage

Thanks, yes that makes sense there is a small multi-day dependency.

In the end I think even when there's misalignment between x and x+ I still only need a single bootstrap - I just have to select and aligned n samples using sample[offset:offset+n]

I've attached a simple example - I've hardcoded the lengths to keep things simple.

prediction-interval.zip

david-waterworth avatar Nov 10 '19 22:11 david-waterworth

Also for the e* bootstrap, if len(e) isn't a multiple of the seasonablity I think it makes (some) sense to pad the tail with the head (i.e. wrap it around) and then when sampling e* you have to select len(x) values from the bootstap (since it's now longer than the original length)

e2 = np.pad(e, (0, 24-len(e) % 24), 'wrap')
e2 = np.reshape(e2,(len(e2) // 24,24))
print(e2.shape)

It seems like there's multiple cases where it would be useful if we could specify the length of the sample?

david-waterworth avatar Nov 10 '19 22:11 david-waterworth