datasets Extreme inefficiency for `save_to

Extreme inefficiency for `save_to_disk` when merging datasets

Open KatarinaYuan opened this issue 1 year ago • 1 comments

Describe the bug

Hi, I tried to merge in total 22M sequences of data, where each sequence is of maximum length 2000. I found that merging these datasets and then save_to_disk is extremely slow because of flattening the indices. Wondering if you have any suggestions or guidance on this. Thank you very much!

Steps to reproduce the bug

The source data is too big to demonstrate

Expected behavior

The source data is too big to demonstrate

Environment info

python 3.9.0 datasets 2.7.0 pytorch 2.0.0 tokenizers 0.13.1 transformers 4.31.0

Dec 29 '23 00:12 KatarinaYuan

Concatenating datasets doesn't create any indices mapping - so flattening indices is not needed (unless you shuffle the dataset). Can you share the snippet of code you are using to merge your datasets and save them to disk ?

Dec 30 '23 15:12 lhoestq

datasets datasets copied to clipboard

Extreme inefficiency for `save_to_disk` when merging datasets

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

datasets
datasets copied to clipboard