datasets
datasets copied to clipboard
[Streaming] Only load requested splits without resolving files for the other splits
e.g. thangvip/cosmopedia_vi_math has 300 splits and it takes a very long time to load only one split.
This is due to load_dataset()
resolving the files of all the splits even if only one is needed.
In dataset-viewer
the splits are loaded in different jobs so it results in 300 jobs that resolve 300 splits -> 90k calls to /paths-info
This should help fixing this issue: https://github.com/huggingface/datasets/pull/6832
I'm having a similar issue when using splices:
It seems to be downloading, loading, and generating splits using the entire dataset.