skforecast
skforecast copied to clipboard
Good First Issue: Allow `initial_train_size` in `backtesting_forecaster` to accept date values
THIS ISSUE IS INTENDED TO BE SOLVED AT THE HACKTOBERFEST https://www.meetup.com/es-ES/pydata-madrid/events/303470661/
Use branch 0.14.x
as base.
Summary
Currently, the initial_train_size
parameter in the backtesting_forecaster
function only accepts an integer value. This integer defines how many observations to use as the initial training set. We would like to extend this functionality so that initial_train_size
can also accept a date (e.g., '2020-01-01'
). If a date is provided, the function should calculate the appropriate number of observations corresponding to the time window between the start of the data and the given date.
Task
- Create an auxiliary function,
_preprocess_initial_train_size(y: pd.Series, initial_train_size)
in theutils
module:
-
initial_train_size
can be an integer or any datetime format that pandas allows to be passed to apd.DatetimeIndex
(e.g., string, pandas timestamp...). - If
y
does not have apd.DatetimeIndex
andinitial_train_size
is not an integer, raise aTypeError
with the message: "Ify
does not have a pd.DatetimeIndex,initial_train_size
must be an integer." - If the series
y
has apd.DatetimeIndex
, this function will return the length of the time window between the start of the data and the given date as an integer value. The given date must be included in the window. - If the input
initial_train_size
is an integer, return the same integer. - Create unit tests using pytest in the
utils.tests
folder.
# Expected behavior
# ==============================================================================
y = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D'))
_preprocess_initial_train_size(y, '2020-01-02') # expected output: 2
y = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2020-01-01', periods=5, freq='D'))
_preprocess_initial_train_size(y, 2) # expected output: 2
y = pd.Series([1, 2, 3, 4, 5], index=pd.RangeIndex(start=0, stop=5, step=1))
_preprocess_initial_train_size(y, '2020-01-02') # expected output: TypeError
- Integrate this function with
_backtesting_forecasting
andbacktesting_forecasting
in themodel_selection
module.
Acceptance Criteria
- [ ] The
initial_train_size
parameter accepts both integer and date formats. - [ ] The function correctly calculates the initial training size when a date is provided.
- [ ] Existing tests continue to pass.
- [ ] New test cases are added to verify the correct behavior for both int and date inputs.
Full Example
The initial training set must contain 127 observations and the results must be the same if initial_train_size = 127
.
# Expected behavior
# ==============================================================================
data = fetch_dataset(name="h2o", kwargs_read_csv={"names": ["y", "datetime"], "header": 0})
initial_train_size = '2002-01-01 00:00:00'
forecaster = ForecasterAutoreg(
regressor = LGBMRegressor(random_state=123, verbose=-1),
lags = 15
)
cv = TimeSeriesFold(
steps = 10,
initial_train_size = initial_train_size,
refit = False,
fixed_train_size = False,
gap = 0,
allow_incomplete_fold = True
)
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
cv = cv,
y = data['y'],
metric = 'mean_squared_error',
verbose = True,
show_progress = True
)