neuralforecast
neuralforecast copied to clipboard
predict_insample step_size fix
Fixes https://github.com/Nixtla/neuralforecast/issues/866
Solution
- predict_insample always sets step_size=1 at the beginning of the function but stores the value set by the user.
- the troubling condition will always be satisfied
- at the end of the current function, filter the extra
cutoff
dates of thefcsts_df
dataframe considering the initial step_size requested by user, so if we have h=3 and step_size=2 we'll want indices 0, 1, 2, 6, 7, 8, 12, 13, 14 (first cutoff, third, cutoff, fifth cutoff, etc)
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@quest-bot loot #866
Quest PR submitted!
@isaac-chung You are attempting to solve the issue and loot this Quest. Will you be successful?
Questions? Check out the docs.
@AzulGarza Yes that makes sense, thanks for the feedback. I've updated the PR description and pushed a new commit that reflect the changes requested.
Hey @isaac-chung, the series may have different starts and ends (thus different cutoffs), so the current solution won't work. We have the "advantage" that there's an issue at the moment where all the series need to have the same length for predict_insample to work, so we can make use of that to generate the indices of the rows that we want to keep. The result of the predict is sorted by serie, so if we have h=3
and step_size=2
we'll want indices 0, 1, 2, 6, 7, 8, 12, 13, 14
(first cutoff, third, cutoff, fifth cutoff, etc), which we can generate with the following:
n_series = len(self.uids)
n_windows = fcsts_df.shape[0] // (n_series * h)
offsets = np.repeat(np.arange(0, n_series * n_windows, step_size), h) * h
horizons = np.tile(np.arange(h), offsets.size // h)
keep_rows = offsets + horizons
and then use these to slice the fcsts_df
. Keep in mind that this has to work for both pandas and polars, so we have a function to do this:
fcsts_df = ufp.take_rows(fcsts_df, keep_rows)
Please do this slicing after the horizontal concat https://github.com/Nixtla/neuralforecast/blob/0c2d196c08ff592a3203bae059e422cc6e63127c/neuralforecast/core.py#L921 to reduce the size of the following join.
Also please add a test to verify that the cutoffs that we get are the expected ones, e.g. step_size = 2 should skip one period in every cutoff, step_size = 3 should skip two periods, etc.
Thanks @jmoralez . For this test, do you have a preference on where it lives? e.g. a new cell in nbs/core.ipynb
?
There's a cell with the comment # Test predict_insample
in nbs/core.ipynb
, please add the test in that cell below the current ones with the comment # test step_size
or similar.
@jmoralez thanks again. See if this aligns with what we had discussed above.
Are you not able to run the test because of this?
Also a better test would be checking the cutoffs explicitly, something like this:
train_start = AirPassengersPanel_train['ds'].min()
first_cutoff = train_start - pd.offsets.MonthEnd()
sizes = AirPassengersPanel_train['unique_id'].value_counts().to_numpy()
for step_size in range(1, 4):
n_expected_cutoffs = sizes[0] // step_size
expected_cutoffs = np.array([first_cutoffs + step_size * i * pd.offsets.MonthEnd() for i in range(n_expected_cutoffs)])
np.testing.assert_array_equal(forecasts['cutoff'].unique(), expected_cutoffs)
I wasn't able to run it before as my system has more than 1 GPU. I ran export CUDA_VISIBLE_DEVICES=0
and it seems to have been resolved.
Running the test in the suggestion fails for all step sizes. It seems like forecasts['cutoff'].unique()
was missing an extra year (the last year, 1959 in the test). Does that sound expected? I am trying to understand if expected_cutoffs
got another year or if forecasts
missed a year.
Probably the cutoffs got an extra year. The last cutoff should be one year before (because h=12 and monthly freq) the last date in the train dataset. Which makes me think we may also need to generate the indices starting from the back, because we may truncate in some situations and leave that one out.
Makes sense. I updated the test to not include the last h periods. What do you mean by "generate the indices starting from the back"?
We always want to have the last cutoff, so if h=2, step_size=3 and the length is 7 we currently compute n_windows=length // h = 7 // 2 = 3 and then we generate offsets=[0, 0], horizons=[0, 1], keep_rows=[0, 1], so we're only keeping the first offset and we would like to keep the last instead. So we should modify the indices generation to account for this.
I'm confused as to 1) whether the last cutoff is needed in addition to the first one, and 2) what we want to keep in that example, where n_window = 3, h=2, step_size=3 and length = 7:
- would the forecasts span [0,1], [3,4], [6,7] (?)
Also just to make sure that I follow what the method is supposed to be doing, I'm reading the diagram in nbs/examples/PredictInsample.ipynb
, where h=4 and step_size=2. It seems to contradict your earlier example:
so if we have h=3 and step_size=2 we'll want indices 0, 1, 2, 6, 7, 8, 12, 13, 14 (first cutoff, third, cutoff, fifth cutoff, etc)
Wouldn't we want [0,1,2], [2,3,4], [4,5,6] etc?
whether the last cutoff is needed in addition to the first one
If there's only one possible cutoff (because there are 3 for example and the step size is 3) we want the last one.
what we want to keep in that example, where n_window = 3, h=2, step_size=3 and length = 7: would the forecasts span [0,1], [3,4], [6,7] (?)
In this case I believe the full in-sample predictions (with step_size=1) would look like this:
cutoff | ds |
---|---|
0 | 1 |
0 | 2 |
1 | 2 |
1 | 3 |
2 | 3 |
2 | 4 |
3 | 4 |
3 | 5 |
4 | 5 |
4 | 6 |
5 | 6 |
5 | 7 |
In this case with the current index generation I think we'd take rows: [0, 1, 6, 7], corresponding to cutoffs 0 and 3 and we'd loose the last cutoff (5). In this case i think we should instead take rows: [4, 5, 10, 11] which would correspond to cutoffs 2 and 5. @cchallu do you agree with this?
Yes, we always want to keep the last cutoff. This is the predict_insample
, so users do not specify the n_windows
parameter. We should always return the last cutoff, and consider the step_size
backwards.
@isaac-chung can you please fix the failing test?
@jmoralez will do first thing when I'm back online in 2 weeks! 👍
@jmoralez I found why the test is failing. The actual cutoffs took into consideration of nf.uids (n_series = 2). Fix is in the newest commit.
Please also add the following check to the tests (which currently fails):
cutoffs_by_serie = forecasts.groupby(['unique_id', 'cutoff']).size().unstack('unique_id')
pd.testing.assert_series_equal(cutoffs_by_serie['Airline1'], cutoffs_by_serie['Airline2'])
Since the series lengths will be the same you should be able to generate the offsets once, tile them and add the n_windows * series_idx or similar.
Apologies - I no longer have time in the next 3-4 weeks to work on this. Feel free to reassign this to someone else. Thanks for the support thus far.