tsai
tsai copied to clipboard
failed to create dataloader for large dataset
Hey, thank you for this package!
I'm having a bit trouble running out of memory when I try to create a Dataloader using get_ts_dls() - my data is about 54 GB I've read the tutorial about the zarr and mmap np array for large datasets so this is not the problem.
I run the following block:
tfms = [None, TSClassification()]
batch_tfms = TSStandardize()
dls = get_ts_dls(X, y, splits=splits, tfms=tfms, batch_tfms=batch_tfms, bs=[64, 128], num_workers=0)
and then I get "Unable to allocate 52.4 GiB for an array with shape (2604215, 30, 180) and data type float32"
Can you please help me with creating dataloader for this large dataset?
There is an official tutorial for your problem: https://colab.research.google.com/github/timeseriesAI/tsai/blob/master/tutorial_nbs/11_How_to_train_big_arrays_faster_with_tsai.ipynb
In short use zarr arrays or np.memmap instead of trying to load your hole dataset into memory
Hey, thanks for the answer...
I actually do use np.memmap and do not load the whole dataset to memory. My RAM issue occurs when I run the following lines:
learn.fit_one_cycle(15, 1e-2)
My DataLoader is based np.memap array and I used a subsample of my whole data. It seems that in the first epoch the learner loads the data to the memory (I can see my memory usage increasing only in the first epoch), after that the memory usage is high but not increasing.
I've read the tutorial multiple times but I still don't have the answer
I've only tested it with zarr arrays on the AMEX Kaggle Data which is bigger then my memory and that works perfectly.
Maybe your batch size is to high?
My code looks like this:
X_large_zarr = zarr.open('zarr.zarr', mode='r')
y_large_zarr = zarr.open('y_large.zarr', mode='r')
...
tfms = [None, TSClassification()]
batch_tfms = TSStandardize(by_sample=True)
dls = get_ts_dls(X_large_zarr, y_large_zarr, splits=splits, tfms=tfms, batch_tfms=batch_tfms, inplace=False, bs=[4_096, 4_096], num_workers=12) # 2*cpus
learn = ts_learner(dls, InceptionTimePlus, metrics=accuracy, cbs=ShowGraph())
4096 samples per batch could be to high for you. Check yourself with lower values
Hi @AvivAlloni , Could you please confirm the proposed solution worked for you? Please bear this in mind:
When creating the dataloaders, there are 2 important things to remember:
* If you are dealing with a classification task and need to transform the levels, use TSClassification. It's a vectorized version of Categorize that runs much faster.
* Set inplace to False. This is required when the data doesn't fit in memory. If you don't use it your system will crash and you will need to start again.
I'll close this issue due to the lack of response.