neuralforecast
neuralforecast copied to clipboard
Issue-409 Add support for datasets that can't fit in memory
As described in this issue: https://github.com/Nixtla/neuralforecast/issues/409
We assume the dataset is split across multiple parquet files - each parquet file corresponds to a single timeseries which is represented as a pandas dataframe. This PR creates a new Dataset class where the getitem method reads the parquet file corresponding to that index, and the from_data_directory() method replicates the from_df() method.
I have added a test to end of core.ipynb that checks the forecasts using this distributed dataset are the same as when the dataset is directly passed in as a pandas dataframe.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Thanks a lot for your contribution @jasminerienecker, I left some comments.
Thanks a lot for working through the changes! I left some more enhancement ideas
Thanks a lot @jasminerienecker! I think after these changes it'll be ready
Last request, it seems that the last test in core is duplicated in the cell above, can you please remove that? Also please write to a temporary directory so that we don't leave files behind, e.g.
with tempfile.TemporaryDirectory() as tmpdir:
AirPassengersPanel_train.to_parquet(tmpdir, partition_cols=['unique_id'], index=False)
data_directory = sorted([str(path) for path in Path(tmpdir).iterdir()])
pred_df = AirPassengersPanel_train[AirPassengersPanel_train['unique_id'] == 'Airline2'].drop(columns='unique_id')
futr_df = AirPassengersPanel_test[AirPassengersPanel_test['unique_id'] == 'Airline2'].drop(columns='unique_id')
nf.fit(df=data_directory, use_init_models=True, id_col='id')
You can place the import tempfile in the cell with all the imports:
#| hide
import tempfile # <- place it here
import matplotlib.pyplot as plt
import pytorch_lightning as pl
import neuralforecast
from ray import tune
from neuralforecast.auto import (
AutoMLP, AutoNBEATS,
AutoRNN, AutoTCN, AutoDilatedRNN,
)
Ah I completely missed that duplication. I've adjusted the test now and I think all the other comments have been resolved - let me know if you spot anything else though. In general I've been running this locally for model training and so far it seems to be working well!
Can you check the failing test? There's a cell above that removes the trend column, you can try adding it again or using one of the features that the other cell introduces
The CI is stuck, I'll try closing and reopening.
@jmoralez ah right, I've moved the test to a more relevant section of the notebook as it seems by the end the train and test datasets didn't have aligned columns. It was working when I ran it locally so hopefully the tests will now pass in the PR too
Can you try pushing an empty commit? Closing didn't work haha
Hmm it still seems to be frozen...
@jmoralez Sure here's a basic tutorial showing the expected format of the data, and briefly demonstrating fitting a model and predicting using this DataLoader https://github.com/Nixtla/neuralforecast/pull/1074 - happy to expand on some areas if it'd be useful but didn't want to duplicate too much with other tutorials!