skforecast icon indicating copy to clipboard operation
skforecast copied to clipboard

Good First Issue: Allow `initial_train_size` in `backtesting_forecaster` to accept date values

Open JavierEscobarOrtiz opened this issue 5 months ago • 1 comments

THIS ISSUE IS INTENDED TO BE SOLVED AT THE HACKTOBERFEST https://www.meetup.com/es-ES/pydata-madrid/events/303470661/

Use branch 0.14.x as base.

Summary

Currently, the initial_train_size parameter in the backtesting_forecaster function only accepts an integer value. This integer defines how many observations to use as the initial training set. We would like to extend this functionality so that initial_train_size can also accept a date (e.g., '2020-01-01'). If a date is provided, the function should calculate the appropriate number of observations corresponding to the time window between the start of the data and the given date.

Task

  1. Create an auxiliary function, _preprocess_initial_train_size(y: pd.Series, initial_train_size) in the utils module:
  • initial_train_size can be an integer or any datetime format that pandas allows to be passed to a pd.DatetimeIndex (e.g., string, pandas timestamp...).
  • If y does not have a pd.DatetimeIndex and initial_train_size is not an integer, raise a TypeError with the message: "If y does not have a pd.DatetimeIndex, initial_train_size must be an integer."
  • If the series y has a pd.DatetimeIndex, this function will return the length of the time window between the start of the data and the given date as an integer value. The given date must be included in the window.
  • If the input initial_train_size is an integer, return the same integer.
  • Create unit tests using pytest in the utils.tests folder.
# Expected behavior
# ==============================================================================
y = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D'))
_preprocess_initial_train_size(y, '2020-01-02') # expected output: 2

y = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D'))
_preprocess_initial_train_size(y, 2) # expected output: 2

y = pd.Series([1, 2, 3, 4, 5], index=pd.RangeIndex(start=0, stop=5, step=1))
_preprocess_initial_train_size(y, '2020-01-02') # expected output: TypeError
  1. Integrate this function with _backtesting_forecasting and backtesting_forecasting in the model_selection module.

Acceptance Criteria

  • [ ] The initial_train_size parameter accepts both integer and date formats.
  • [ ] The function correctly calculates the initial training size when a date is provided.
  • [ ] Existing tests continue to pass.
  • [ ] New test cases are added to verify the correct behavior for both int and date inputs.

Full Example

The initial training set must contain 127 observations and the results must be the same if initial_train_size = 127.

# Expected behavior
# ==============================================================================
data = fetch_dataset(name="h2o", kwargs_read_csv={"names": ["y", "datetime"], "header": 0})
initial_train_size = '2002-01-01 00:00:00'

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=123, verbose=-1),
                 lags      = 15 
             )

cv = TimeSeriesFold(
         steps                 = 10,
         initial_train_size    = initial_train_size,
         refit                 = False,
         fixed_train_size      = False,
         gap                   = 0,
         allow_incomplete_fold = True
     )

metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          cv                    = cv,
                          y                     = data['y'],
                          metric                = 'mean_squared_error',
                          verbose               = True,
                          show_progress         = True  
                      )

JavierEscobarOrtiz avatar Sep 26 '24 09:09 JavierEscobarOrtiz