datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Impossible to only download a test split

Open ysig opened this issue 2 years ago • 2 comments

I've spent a significant amount of time trying to locate the split object inside my _split_generators() custom function. Then after diving in the code I realized that download_and_prepare is executed before! split is passed to the dataset builder in as_dataset.

If I'm not missing something, this seems like bad design, for the following use case:

Imagine there is a huge dataset that has an evaluation test set and you want to just download and run just to compare your method.

Is there a current workaround that can help me achieve the same result?

Thank you,

ysig avatar Dec 22 '23 16:12 ysig

The only way right now is to load with streaming=True

lhoestq avatar Dec 22 '23 20:12 lhoestq

This feature has been proposed for a long time. I'm looking forward to the implementation. On clusters streaming=True is not an option since we do not have Internet on compute nodes. See: https://github.com/huggingface/datasets/discussions/1896#discussioncomment-2359593

why-in-Shanghaitech avatar Feb 02 '24 00:02 why-in-Shanghaitech