datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Extreme inefficiency for `save_to_disk` when merging datasets

Open KatarinaYuan opened this issue 1 year ago • 1 comments

Describe the bug

Hi, I tried to merge in total 22M sequences of data, where each sequence is of maximum length 2000. I found that merging these datasets and then save_to_disk is extremely slow because of flattening the indices. Wondering if you have any suggestions or guidance on this. Thank you very much!

Steps to reproduce the bug

The source data is too big to demonstrate

Expected behavior

The source data is too big to demonstrate

Environment info

python 3.9.0 datasets 2.7.0 pytorch 2.0.0 tokenizers 0.13.1 transformers 4.31.0

KatarinaYuan avatar Dec 29 '23 00:12 KatarinaYuan

Concatenating datasets doesn't create any indices mapping - so flattening indices is not needed (unless you shuffle the dataset). Can you share the snippet of code you are using to merge your datasets and save them to disk ?

lhoestq avatar Dec 30 '23 15:12 lhoestq