skforecast
skforecast copied to clipboard
Bayesian Optimization
I am trying to tune the model using scikit-optimize. But a bunch of errors are coming up. I think it is a good idea to implement bayesian search for this library too.
Hello @CalenDario13!
Nowadays Bayesian Search is not available in skforecast. Our intention is to implement it soon in the 0.5 version.
Thank you for your comments!
Hi @CalenDario13 Would you recommend scikit-optimize as the way to go when applying bayesian optimization to tune sklearn models?
@JoaquinAmatRodrigo I am using th emost of time Optuna because it is more flexible. But it is because the most of time it is hard to find a solution that fit well with some less known libraries. I think that scikit-optimize is very well integrated with scikit-learn and sice skforecast has as purpose to be an integration for time series for scikitlearn, it is a plus (and a huge advantages with respect to other competitors) to make possibe the use of scikit-optimize, adding the option to tune also skforecast parameters such as steps and lags.
I am not an expert in building and manage libraries, but I whish to learn more. Hence, if you think you need help in doing that, I may help you.
Very useful input @CalenDario13 Thanks a lot! We are planning now the new features that will be included in the release 0.5 and this one seems to fit well. Let me text you back when I start with the coding.
@CalenDario13 do you have any example of using Optuna with skforecast, grid_search_forecaster in particular?
Hi @spike8888, @CalenDario13,
Here is an example of using Optuna with Skforecast 0.4.3. For the search, I use backtesting_forecaster
as validation (the same validation that is used in grid_search_forecaster
)
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o.csv')
data = pd.read_csv(url, sep=',', header=0, names=['y', 'datetime'])
# Data preprocessing
# ==============================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y/%m/%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data[['y']]
data = data.sort_index()
# Train-validation dates
# ==============================================================================
end_train = '2002-01-01 23:59:00'
print(f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()} (n={len(data.loc[:end_train])})")
print(f"Validation dates : {data.loc[end_train:].index.min()} --- {data.index.max()} (n={len(data.loc[end_train:])})")
Train dates : 1991-07-01 00:00:00 --- 2002-01-01 00:00:00 (n=127) Validation dates : 2002-02-01 00:00:00 --- 2008-06-01 00:00:00 (n=77)
Here is the objective function using backtesting_forecaster
for a RandomForestRegressor:
import optuna
forecaster = ForecasterAutoreg(
regressor = RandomForestRegressor(random_state=123),
lags = 15
)
y = data['y']
initial_train_size = len(data.loc[:end_train])
fixed_train_size = False
steps = 10
metric = 'mean_squared_error'
refit = True
verbose = False
def objective(trial,
forecaster = forecaster,
y = y,
initial_train_size = initial_train_size,
fixed_train_size = fixed_train_size,
steps = steps,
metric = metric,
refit = refit,
verbose = verbose):
n_estimators = trial.suggest_int('n_estimators', 50, 200)
max_depth = trial.suggest_float('max_depth', 1, 32, log=True)
lags = trial.suggest_int('lags', 10, 20)
forecaster = ForecasterAutoreg(
regressor = RandomForestRegressor(random_state=123,
n_estimators=n_estimators,
max_depth=max_depth),
lags = lags
)
metric, predictions_backtest = backtesting_forecaster(
forecaster = forecaster,
y = y,
initial_train_size = initial_train_size,
fixed_train_size = fixed_train_size,
steps = steps,
metric = metric,
refit = refit,
verbose = verbose
)
return abs(metric)
And, then, the study:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=15)
trial = study.best_trial
print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))
... [I 2022-04-24 11:23:01,545] Trial 14 finished with value: 0.008557239946036514 and parameters: {'n_estimators': 142, 'max_depth': 26.51220993361588, 'lags': 16}. Best is trial 13 with value: 0.008054432438600578.
Accuracy: 0.008054432438600578 Best hyperparameters: {'n_estimators': 112, 'max_depth': 4.25542734578409, 'lags': 17}
trial
FrozenTrial(number=13, values=[0.008054432438600578], datetime_start=datetime.datetime(2022, 4, 24, 11, 22, 57, 247950), datetime_complete=datetime.datetime(2022, 4, 24, 11, 22, 59, 44261), params={'n_estimators': 112, 'max_depth': 4.25542734578409, 'lags': 17}, distributions={'n_estimators': IntUniformDistribution(high=200, low=50, step=1), 'max_depth': LogUniformDistribution(high=32.0, low=1.0), 'lags': IntUniformDistribution(high=20, low=10, step=1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=13, state=TrialState.COMPLETE, value=None)
We are working on a better implementation for Skforecast 0.5.0!
Thank you so much. You made my day!
Another beginner question - what are the conditions for refit = True?
I have below error:
d:\programy\miniconda3\lib\site-packages\skforecast\ForecasterAutoreg\ForecasterAutoreg.py in _recursive_predict(self, steps, last_window, exog) 405 406 for i in range(steps): --> 407 X = last_window[-self.lags].reshape(1, -1) 408 if exog is not None: 409 X = np.column_stack((X, exog[i, ].reshape(1, -1)))
IndexError: index -6 is out of bounds for axis 0 with size 4
If it is important from input side I have following data:
data.shape (50,) data_train.shape (37,) data_test.shape (13,) steps = 13 initial lags: lags = int(data_train.shape[0]*0.4) = 14
whole grid search looks like that:
forecaster_rf = ForecasterAutoreg(
regressor = XGBRegressor(verbosity=1),
lags = lags
)
param_grid = {
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': np.arange(2, 22, 2)
}
lags_grid = [6, 12, lags, [1, 3, 6, 12, lags]]
below lags throws an error too: lags_grid = np.arange(1, 3, 1) lags_grid = [1]
metric = mean_squared_log_error
results_grid = grid_search_forecaster(
forecaster = forecaster_rf,
y = data_train,
param_grid = param_grid,
steps = steps,
metric = metric,
refit = True,
initial_train_size = int(len(data_train)*0.5),
return_best = True,
verbose = True
)
Hello @spike8888,
With this info, the only error that I see is that you didn't pass the argument lags_grid
to grid_search_forecaster
. If I didn't misunderstand your code, this should work:
lags = int(data_train.shape[0]*0.4)
lags_grid = [6, 12, lags, [1, 3, 6, 12, lags], np.arange(1, 3, 1), 1]
metric = mean_squared_log_error
results_grid = grid_search_forecaster(
forecaster = forecaster_rf,
y = data_train,
param_grid = param_grid,
lags_grid = lags_grid,
steps = steps,
metric = metric,
refit = True,
initial_train_size = int(len(data_train)*0.5),
return_best = True,
verbose = True
)
You can visit the documentation for grid_search_forecaster
here:
https://joaquinamatrodrigo.github.io/skforecast/latest/notebooks/grid-search-forecaster.html
Referring to backtesting (validation used in grid_search_forecsater
) refit=True doesn't require any additional configuration. You can find a good explanation here:
https://joaquinamatrodrigo.github.io/skforecast/latest/notebooks/backtesting.html
Hi @JavierEscobarOrtiz,
thanks for reviewing my code - my mistake, there obviously was lags_grid in my code originally.
I made some testing and this range of lags works: lags_grid = [np.arange(1, lags-4, 1)]
more lags then 10 throws out an error. Is there any condition (limit) of nr of lags, lags_grid vs steps, training length etc.? I didn't find it in documentation.
I note additional thing: 10 lags calculates relatively quickly while more then 10 (like max legs in my case=14) take quite long time and then throws an error.
Feature added in version 0.5.0.
https://joaquinamatrodrigo.github.io/skforecast/latest/user_guides/hyperparameter-tuning-and-lags-selection.html#bayesian-search