neuralforecast icon indicating copy to clipboard operation
neuralforecast copied to clipboard

Issue-409 Add support for datasets that can't fit in memory

Open jasminerienecker opened this issue 1 year ago • 4 comments

As described in this issue: https://github.com/Nixtla/neuralforecast/issues/409

We assume the dataset is split across multiple parquet files - each parquet file corresponds to a single timeseries which is represented as a pandas dataframe. This PR creates a new Dataset class where the getitem method reads the parquet file corresponding to that index, and the from_data_directory() method replicates the from_df() method.

I have added a test to end of core.ipynb that checks the forecasts using this distributed dataset are the same as when the dataset is directly passed in as a pandas dataframe.

jasminerienecker avatar Jul 01 '24 03:07 jasminerienecker

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jul 01 '24 03:07 CLAassistant

Thanks a lot for your contribution @jasminerienecker, I left some comments.

jmoralez avatar Jul 03 '24 18:07 jmoralez

Thanks a lot for working through the changes! I left some more enhancement ideas

jmoralez avatar Jul 04 '24 01:07 jmoralez

Thanks a lot @jasminerienecker! I think after these changes it'll be ready

jmoralez avatar Jul 04 '24 19:07 jmoralez

Last request, it seems that the last test in core is duplicated in the cell above, can you please remove that? Also please write to a temporary directory so that we don't leave files behind, e.g.

with tempfile.TemporaryDirectory() as tmpdir:
    AirPassengersPanel_train.to_parquet(tmpdir, partition_cols=['unique_id'], index=False)
    data_directory = sorted([str(path) for path in Path(tmpdir).iterdir()])
    
    pred_df = AirPassengersPanel_train[AirPassengersPanel_train['unique_id'] == 'Airline2'].drop(columns='unique_id')
    futr_df = AirPassengersPanel_test[AirPassengersPanel_test['unique_id'] == 'Airline2'].drop(columns='unique_id')
    
    nf.fit(df=data_directory, use_init_models=True, id_col='id')

You can place the import tempfile in the cell with all the imports:

#| hide
import tempfile # <- place it here

import matplotlib.pyplot as plt
import pytorch_lightning as pl

import neuralforecast
from ray import tune

from neuralforecast.auto import (
    AutoMLP, AutoNBEATS, 
    AutoRNN, AutoTCN, AutoDilatedRNN,
)

jmoralez avatar Jul 18 '24 17:07 jmoralez

Ah I completely missed that duplication. I've adjusted the test now and I think all the other comments have been resolved - let me know if you spot anything else though. In general I've been running this locally for model training and so far it seems to be working well!

jasminerienecker avatar Jul 18 '24 22:07 jasminerienecker

Can you check the failing test? There's a cell above that removes the trend column, you can try adding it again or using one of the features that the other cell introduces

jmoralez avatar Jul 18 '24 23:07 jmoralez

The CI is stuck, I'll try closing and reopening.

jmoralez avatar Jul 19 '24 00:07 jmoralez

@jmoralez ah right, I've moved the test to a more relevant section of the notebook as it seems by the end the train and test datasets didn't have aligned columns. It was working when I ran it locally so hopefully the tests will now pass in the PR too

jasminerienecker avatar Jul 19 '24 00:07 jasminerienecker

Can you try pushing an empty commit? Closing didn't work haha

jmoralez avatar Jul 19 '24 00:07 jmoralez

Hmm it still seems to be frozen...

jasminerienecker avatar Jul 19 '24 00:07 jasminerienecker

@jmoralez Sure here's a basic tutorial showing the expected format of the data, and briefly demonstrating fitting a model and predicting using this DataLoader https://github.com/Nixtla/neuralforecast/pull/1074 - happy to expand on some areas if it'd be useful but didn't want to duplicate too much with other tutorials!

jasminerienecker avatar Jul 22 '24 01:07 jasminerienecker