datasets
datasets copied to clipboard
Extreme inefficiency for `save_to_disk` when merging datasets
Describe the bug
Hi, I tried to merge in total 22M sequences of data, where each sequence is of maximum length 2000. I found that merging these datasets and then save_to_disk
is extremely slow because of flattening the indices. Wondering if you have any suggestions or guidance on this. Thank you very much!
Steps to reproduce the bug
The source data is too big to demonstrate
Expected behavior
The source data is too big to demonstrate
Environment info
python 3.9.0 datasets 2.7.0 pytorch 2.0.0 tokenizers 0.13.1 transformers 4.31.0
Concatenating datasets doesn't create any indices mapping - so flattening indices is not needed (unless you shuffle the dataset). Can you share the snippet of code you are using to merge your datasets and save them to disk ?