skforecast icon indicating copy to clipboard operation
skforecast copied to clipboard

Grid search forecaster error:

Open tavlox opened this issue 1 year ago • 3 comments

I am trying to implement grid search to tune hyperparameters for RandomForestRegressor, but I received this error and I do not know where is exactly the problem since I also tried with .iloc instead of .loc and the problem continues to appear. KeyError: "None of [Int64Index([ 1, 2, 3, 4, 5,
6, 7, 8, 9, 10,\n ...\n 666, 667, 668, 669, 670, 671, 672, 673, 674, 675],\n
dtype='int64', length=43011)] are in the [index]" My code: `from data_preparation import Preparation from missing_timestamps import remove_duplicates import pandas as pd from skforecast.ForecasterAutoreg import ForecasterAutoreg from skforecast.model_selection import grid_search_forecaster from sklearn.ensemble import RandomForestRegressor from skforecast.utils import save_forecaster from skforecast.utils import load_forecaster

marvin

data = Preparation(r'/home/ieftimska/operato-meteo-1/data/MAS_processed/ELES-MAS-5001.csv.gz', "AMBIENT_TEMPERATURE") #data = Preparation(r'/home/iva/Desktop/operato-meteo-1/data/MAS_processed/ELES-MAS-5001.csv.gz', "AMBIENT_TEMPERATURE") train, test = data.split() train_processed = remove_duplicates(train) #train_processed_ = train_processed["AMBIENT_TEMPERATURE"].copy().squeeze() test_processed = remove_duplicates(test) #test_processed_ = test_processed["AMBIENT_TEMPERATURE"].copy().squeeze() whole_data = pd.concat([train_processed, test_processed]) whole_data = whole_data.rename(columns={"AMBIENT_TEMPERATURE": "y"}) whole_data.index = whole_data.index.rename("datetime")

forecaster = ForecasterAutoreg(regressor=RandomForestRegressor(random_state=123, n_jobs=-1, max_depth=10, n_estimators=100), lags=865) param_grid = { 'n_estimators': [50, 100], 'max_depth': [5, 10, 15] }

Lags used as predictors

lags_grid = [i for i in range(1, 865)]

results_grid = grid_search_forecaster( forecaster=forecaster, y=whole_data.loc[:, "y"], param_grid=param_grid, lags_grid=lags_grid, steps=864, refit=False, metric='mean_squared_error', initial_train_size=len(whole_data.loc[:"2022"]), fixed_train_size=False, return_best=True, n_jobs='auto', verbose=False, show_progress=True ) results_grid.to_csv("results_grid_search.csv")`

tavlox avatar Sep 18 '23 10:09 tavlox

This is the data if someones tries to reproduce the result ELES-MAS-5001.csv.gz

tavlox avatar Sep 18 '23 10:09 tavlox

Hello @tavlox

The problem is probably in len(whole_data.loc[:"2022"]. If you are using .iloc you should use an int to access position 2022, not "2022". With .loc it depends on your index, if it is a datetime index you should probably specify something like "01-01-2022".

JavierEscobarOrtiz avatar Sep 18 '23 10:09 JavierEscobarOrtiz

It still appears the same issue, even when I use for example separated train set, without using loc so len(train_set).

tavlox avatar Sep 18 '23 12:09 tavlox