datasets [Streaming] Only load requested splits without resolving files for the other splits

[Streaming] Only load requested splits without resolving files for the other splits

Open lhoestq opened this issue 9 months ago • 2 comments

e.g. thangvip/cosmopedia_vi_math has 300 splits and it takes a very long time to load only one split.

This is due to load_dataset() resolving the files of all the splits even if only one is needed.

In dataset-viewer the splits are loaded in different jobs so it results in 300 jobs that resolve 300 splits -> 90k calls to /paths-info

Apr 29 '24 09:04 lhoestq

This should help fixing this issue: https://github.com/huggingface/datasets/pull/6832

Apr 29 '24 09:04 lhoestq

I'm having a similar issue when using splices:

It seems to be downloading, loading, and generating splits using the entire dataset.

May 07 '24 04:05 akaashdash