skforecast icon indicating copy to clipboard operation
skforecast copied to clipboard

IndexError When lags is greater than number of steps skforecast==0.4.3

Open JoaquinAmatRodrigo opened this issue 2 years ago • 8 comments

Another beginner question - what are the conditions for refit = True?

I have below error:

d:\programy\miniconda3\lib\site-packages\skforecast\ForecasterAutoreg\ForecasterAutoreg.py in _recursive_predict(self, steps, last_window, exog) 405 406 for i in range(steps): --> 407 X = last_window[-self.lags].reshape(1, -1) 408 if exog is not None: 409 X = np.column_stack((X, exog[i, ].reshape(1, -1)))

IndexError: index -6 is out of bounds for axis 0 with size 4

If it is important from input side I have following data:

data.shape (50,) data_train.shape (37,) data_test.shape (13,) steps = 13 initial lags: lags = int(data_train.shape[0]*0.4) = 14

whole grid search looks like that:

forecaster_rf = ForecasterAutoreg(
                    regressor = XGBRegressor(verbosity=1),
                    lags = lags
             )
param_grid = {
            'gamma': [0.5, 1, 1.5, 2, 5],
            'subsample': [0.6, 0.8, 1.0],
            'colsample_bytree': [0.6, 0.8, 1.0],
            'max_depth': np.arange(2, 22, 2)
            }

lags_grid = [6, 12, lags, [1, 3, 6, 12, lags]]

below lags throws an error too: lags_grid = np.arange(1, 3, 1) lags_grid = [1]

metric = mean_squared_log_error

results_grid = grid_search_forecaster(
                        forecaster         = forecaster_rf,
                        y                  = data_train,
                        param_grid         = param_grid,
                        steps              = steps,
                        metric             = metric,
                        refit              = True,
                        initial_train_size = int(len(data_train)*0.5),
                        return_best        = True,
                        verbose            = True
                   )

Originally posted by @spike8888 in https://github.com/JoaquinAmatRodrigo/skforecast/issues/137#issuecomment-1108727110

JoaquinAmatRodrigo avatar Apr 27 '22 08:04 JoaquinAmatRodrigo

Hi!

has anyone time and chance to look at this problem?

spike8888 avatar May 05 '22 11:05 spike8888

Hi @spike8888, This error is probably due to a bug in the piece of code that stores the values of last window. We are trying to identify and solve it.

JoaquinAmatRodrigo avatar May 06 '22 10:05 JoaquinAmatRodrigo

Hi @spike8888,

The error occurs when max_lag > observations used for training. In your example:

max_lag = 12 initial_train_size = 18

Therefore, the number of observations used in fit is 18 - 12 = 6.

Since last_window only stored the number of observations used in fit, 6 in this case, the function returns an error because it needs the last 12 values to predict the step n+1.

We fixed it in version 0.5.0. We are still developing this version but you can install it from GitHub using in the shell:

pip install git+https://github.com/JoaquinAmatRodrigo/[email protected] 

Please, note that some features are still under development, like bayesian_search_forecaster, inside this release. But, whatever you do with the previous versions, should work in the new one.

JavierEscobarOrtiz avatar May 09 '22 19:05 JavierEscobarOrtiz

Thank you very much for an answer. I will check it out soon.

spike8888 avatar May 19 '22 20:05 spike8888

I checked it out. Error gone, it seems there is stop rule in the code which is somewhat dangerous because in my case grid search stopped after 2 model calculated. Please consider displaying warning informing that considering mix of lags and steps not all combinations will be calculated Is there a function that can return max_lag based on the data?

spike8888 avatar May 24 '22 11:05 spike8888

Hello @spike8888, Could you show an example of your grid_search? I didn't understand your problem.

Regarding max_lag, the training matrix will have a length equal to len(y) - max_lag. So, in an extreme case, if your serie y has 50 data points and you use a max_lag = 48 you will only have 2 rows to train your model.

JavierEscobarOrtiz avatar May 25 '22 11:05 JavierEscobarOrtiz

It seems I do not understand whole concept of lags. Are they used to predict next step (next value I want to predict)? If so why we put whole history as training much greater then lags?

spike8888 avatar Jun 11 '22 22:06 spike8888

Hello @spike8888,

You can find a good explanation about lags and the training matrix in the documentation or even googling it.

To summarize, in an autoregressive model the model is trained with his past behavior. If you use for example lags=3 it will take the 3 steps before each point to train the model. The function create_train_X_y can help you to understand this:

# Create a forecaster with lags=3
# ==============================================================================
forecaster = ForecasterAutoreg(
                    regressor = RandomForestRegressor(random_state=123),
                    lags      = 3
             )

# Create a series with 10 points
# ==============================================================================
y = pd.Series(np.arange(10))

display(forecaster.create_train_X_y(y=y)[1])

Then we can print the training matrix.

X:

forecaster.create_train_X_y(y=y)[0]
lag_1 lag_2 lag_3
3 2 1 0
4 3 2 1
5 4 3 2
6 5 4 3
7 6 5 4
8 7 6 5
9 8 7 6

y:

forecaster.create_train_X_y(y=y)[1]
y
3 3
4 4
5 5
6 6
7 7
8 8
9 9

JavierEscobarOrtiz avatar Jun 13 '22 08:06 JavierEscobarOrtiz

Fixed it in version 0.5.0.

JoaquinAmatRodrigo avatar Sep 24 '22 09:09 JoaquinAmatRodrigo