skforecast icon indicating copy to clipboard operation
skforecast copied to clipboard

Bayesian Optimization

Open CalenDario13 opened this issue 2 years ago • 10 comments

I am trying to tune the model using scikit-optimize. But a bunch of errors are coming up. I think it is a good idea to implement bayesian search for this library too.

CalenDario13 avatar Apr 07 '22 09:04 CalenDario13

Hello @CalenDario13!

Nowadays Bayesian Search is not available in skforecast. Our intention is to implement it soon in the 0.5 version.

Thank you for your comments!

JavierEscobarOrtiz avatar Apr 07 '22 10:04 JavierEscobarOrtiz

Hi @CalenDario13 Would you recommend scikit-optimize as the way to go when applying bayesian optimization to tune sklearn models?

JoaquinAmatRodrigo avatar Apr 07 '22 10:04 JoaquinAmatRodrigo

@JoaquinAmatRodrigo I am using th emost of time Optuna because it is more flexible. But it is because the most of time it is hard to find a solution that fit well with some less known libraries. I think that scikit-optimize is very well integrated with scikit-learn and sice skforecast has as purpose to be an integration for time series for scikitlearn, it is a plus (and a huge advantages with respect to other competitors) to make possibe the use of scikit-optimize, adding the option to tune also skforecast parameters such as steps and lags.

I am not an expert in building and manage libraries, but I whish to learn more. Hence, if you think you need help in doing that, I may help you.

CalenDario13 avatar Apr 07 '22 14:04 CalenDario13

Very useful input @CalenDario13 Thanks a lot! We are planning now the new features that will be included in the release 0.5 and this one seems to fit well. Let me text you back when I start with the coding.

JoaquinAmatRodrigo avatar Apr 13 '22 07:04 JoaquinAmatRodrigo

@CalenDario13 do you have any example of using Optuna with skforecast, grid_search_forecaster in particular?

spike8888 avatar Apr 23 '22 12:04 spike8888

Hi @spike8888, @CalenDario13,

Here is an example of using Optuna with Skforecast 0.4.3. For the search, I use backtesting_forecaster as validation (the same validation that is used in grid_search_forecaster)

# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o.csv')
data = pd.read_csv(url, sep=',', header=0, names=['y', 'datetime'])

# Data preprocessing
# ==============================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y/%m/%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data[['y']]
data = data.sort_index()

# Train-validation dates
# ==============================================================================
end_train = '2002-01-01 23:59:00'

print(f"Train dates      : {data.index.min()} --- {data.loc[:end_train].index.max()}  (n={len(data.loc[:end_train])})")
print(f"Validation dates : {data.loc[end_train:].index.min()} --- {data.index.max()}  (n={len(data.loc[end_train:])})")

Train dates : 1991-07-01 00:00:00 --- 2002-01-01 00:00:00 (n=127) Validation dates : 2002-02-01 00:00:00 --- 2008-06-01 00:00:00 (n=77)

Here is the objective function using backtesting_forecaster for a RandomForestRegressor:

import optuna

forecaster = ForecasterAutoreg(
                regressor = RandomForestRegressor(random_state=123),
                lags      = 15
             )

y          = data['y']
initial_train_size = len(data.loc[:end_train])
fixed_train_size   = False
steps      = 10
metric     = 'mean_squared_error'
refit      = True
verbose    = False

def objective(trial,
              forecaster = forecaster,
              y          = y,
              initial_train_size = initial_train_size,
              fixed_train_size   = fixed_train_size,
              steps      = steps,
              metric     = metric,
              refit      = refit,
              verbose    = verbose):
    
    n_estimators = trial.suggest_int('n_estimators', 50, 200)
    max_depth = trial.suggest_float('max_depth', 1, 32, log=True)
    lags = trial.suggest_int('lags', 10, 20)
    
    forecaster = ForecasterAutoreg(
                regressor = RandomForestRegressor(random_state=123,
                                                  n_estimators=n_estimators,
                                                  max_depth=max_depth),
                lags      = lags
             )
    
    metric, predictions_backtest = backtesting_forecaster(
                                    forecaster = forecaster,
                                    y          = y,
                                    initial_train_size = initial_train_size,
                                    fixed_train_size   = fixed_train_size,
                                    steps      = steps,
                                    metric     = metric,
                                    refit      = refit,
                                    verbose    = verbose
                                   )
    return abs(metric)

And, then, the study:

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=15)

trial = study.best_trial

print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

... [I 2022-04-24 11:23:01,545] Trial 14 finished with value: 0.008557239946036514 and parameters: {'n_estimators': 142, 'max_depth': 26.51220993361588, 'lags': 16}. Best is trial 13 with value: 0.008054432438600578.

Accuracy: 0.008054432438600578 Best hyperparameters: {'n_estimators': 112, 'max_depth': 4.25542734578409, 'lags': 17}

trial

FrozenTrial(number=13, values=[0.008054432438600578], datetime_start=datetime.datetime(2022, 4, 24, 11, 22, 57, 247950), datetime_complete=datetime.datetime(2022, 4, 24, 11, 22, 59, 44261), params={'n_estimators': 112, 'max_depth': 4.25542734578409, 'lags': 17}, distributions={'n_estimators': IntUniformDistribution(high=200, low=50, step=1), 'max_depth': LogUniformDistribution(high=32.0, low=1.0), 'lags': IntUniformDistribution(high=20, low=10, step=1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=13, state=TrialState.COMPLETE, value=None)

We are working on a better implementation for Skforecast 0.5.0!

JavierEscobarOrtiz avatar Apr 24 '22 09:04 JavierEscobarOrtiz

Thank you so much. You made my day!

spike8888 avatar Apr 24 '22 14:04 spike8888

Another beginner question - what are the conditions for refit = True?

I have below error:

d:\programy\miniconda3\lib\site-packages\skforecast\ForecasterAutoreg\ForecasterAutoreg.py in _recursive_predict(self, steps, last_window, exog) 405 406 for i in range(steps): --> 407 X = last_window[-self.lags].reshape(1, -1) 408 if exog is not None: 409 X = np.column_stack((X, exog[i, ].reshape(1, -1)))

IndexError: index -6 is out of bounds for axis 0 with size 4

If it is important from input side I have following data:

data.shape (50,) data_train.shape (37,) data_test.shape (13,) steps = 13 initial lags: lags = int(data_train.shape[0]*0.4) = 14

whole grid search looks like that:

forecaster_rf = ForecasterAutoreg(
                    regressor = XGBRegressor(verbosity=1),
                    lags = lags
             )
param_grid = {
            'gamma': [0.5, 1, 1.5, 2, 5],
            'subsample': [0.6, 0.8, 1.0],
            'colsample_bytree': [0.6, 0.8, 1.0],
            'max_depth': np.arange(2, 22, 2)
            }

lags_grid = [6, 12, lags, [1, 3, 6, 12, lags]]

below lags throws an error too: lags_grid = np.arange(1, 3, 1) lags_grid = [1]

metric = mean_squared_log_error

results_grid = grid_search_forecaster(
                        forecaster         = forecaster_rf,
                        y                  = data_train,
                        param_grid         = param_grid,
                        steps              = steps,
                        metric             = metric,
                        refit              = True,
                        initial_train_size = int(len(data_train)*0.5),
                        return_best        = True,
                        verbose            = True
                   )

spike8888 avatar Apr 25 '22 15:04 spike8888

Hello @spike8888,

With this info, the only error that I see is that you didn't pass the argument lags_grid to grid_search_forecaster. If I didn't misunderstand your code, this should work:

lags = int(data_train.shape[0]*0.4) 
lags_grid = [6, 12, lags, [1, 3, 6, 12, lags], np.arange(1, 3, 1), 1]

metric = mean_squared_log_error

results_grid = grid_search_forecaster(
                        forecaster         = forecaster_rf,
                        y                  = data_train,
                        param_grid         = param_grid,
                        lags_grid          = lags_grid,
                        steps              = steps,
                        metric             = metric,
                        refit              = True,
                        initial_train_size = int(len(data_train)*0.5),
                        return_best        = True,
                        verbose            = True
                   )

You can visit the documentation for grid_search_forecaster here:

https://joaquinamatrodrigo.github.io/skforecast/latest/notebooks/grid-search-forecaster.html

Referring to backtesting (validation used in grid_search_forecsater) refit=True doesn't require any additional configuration. You can find a good explanation here:

https://joaquinamatrodrigo.github.io/skforecast/latest/notebooks/backtesting.html

JavierEscobarOrtiz avatar Apr 25 '22 18:04 JavierEscobarOrtiz

Hi @JavierEscobarOrtiz,

thanks for reviewing my code - my mistake, there obviously was lags_grid in my code originally.

I made some testing and this range of lags works: lags_grid = [np.arange(1, lags-4, 1)]

more lags then 10 throws out an error. Is there any condition (limit) of nr of lags, lags_grid vs steps, training length etc.? I didn't find it in documentation.

I note additional thing: 10 lags calculates relatively quickly while more then 10 (like max legs in my case=14) take quite long time and then throws an error.

spike8888 avatar Apr 25 '22 19:04 spike8888

Feature added in version 0.5.0.

https://joaquinamatrodrigo.github.io/skforecast/latest/user_guides/hyperparameter-tuning-and-lags-selection.html#bayesian-search

JoaquinAmatRodrigo avatar Sep 24 '22 09:09 JoaquinAmatRodrigo