datasets Support downloading specific splits in `load

This PR builds on https://github.com/huggingface/datasets/pull/6639 to support downloading only the specified splits in load_dataset. For this to work, a builder's _split_generators need to be able to accept the requested splits (as a list) via a splits argument to avoid processing the non-requested ones. Also, the builder has to define a _available_splits method that lists all the possible splits values.

Close https://github.com/huggingface/datasets/issues/4101, close https://github.com/huggingface/datasets/issues/2538 (I'm probably missing some)

Should also make it possible to address https://github.com/huggingface/datasets/issues/6793

Apr 23 '24 12:04 mariosasko

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Apr 23 '24 12:04 HuggingFaceDocBuilderDev

Friendly ping on this! This feature would be really helpful and useful to me (and likely others with limited download speed and storage space!). Thanks so much!

Aug 19 '24 14:08 BlackHC

No one is working on this atm afaik :/

Aug 19 '24 15:08 lhoestq

No worries! I've patched the ImageNet dataset in: https://huggingface.co/datasets/ILSVRC/imagenet-1k/blob/refs%2Fpr%2F20/imagenet-1k.py

Together with:

dataset = load_dataset(
        "ILSVRC/imagenet-1k",
        split="validation",
        data_files={"val": "data/val_images.tar.gz"},
        revision="refs/pr/20",
        trust_remote_code=True,
        download_config=DownloadConfig(resume_download=True),
        verification_mode=VerificationMode.NO_CHECKS,
    )

It only downloads the validation set this way (NO_CHECKS is a bit annoying because I'd rather have md5 checks, but I guess I can't have everything) ^^' The patch is not perfect, but it does the job.

Aug 19 '24 15:08 BlackHC

Support downloading specific splits in `load_dataset`