datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Support downloading specific splits in `load_dataset`

Open mariosasko opened this issue 1 year ago • 4 comments

This PR builds on https://github.com/huggingface/datasets/pull/6639 to support downloading only the specified splits in load_dataset. For this to work, a builder's _split_generators need to be able to accept the requested splits (as a list) via a splits argument to avoid processing the non-requested ones. Also, the builder has to define a _available_splits method that lists all the possible splits values.

Close https://github.com/huggingface/datasets/issues/4101, close https://github.com/huggingface/datasets/issues/2538 (I'm probably missing some)

Should also make it possible to address https://github.com/huggingface/datasets/issues/6793

mariosasko avatar Apr 23 '24 12:04 mariosasko

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Friendly ping on this! This feature would be really helpful and useful to me (and likely others with limited download speed and storage space!). Thanks so much!

BlackHC avatar Aug 19 '24 14:08 BlackHC

No one is working on this atm afaik :/

lhoestq avatar Aug 19 '24 15:08 lhoestq

No worries! I've patched the ImageNet dataset in: https://huggingface.co/datasets/ILSVRC/imagenet-1k/blob/refs%2Fpr%2F20/imagenet-1k.py

Together with:

dataset = load_dataset(
        "ILSVRC/imagenet-1k",
        split="validation",
        data_files={"val": "data/val_images.tar.gz"},
        revision="refs/pr/20",
        trust_remote_code=True,
        download_config=DownloadConfig(resume_download=True),
        verification_mode=VerificationMode.NO_CHECKS,
    )

It only downloads the validation set this way (NO_CHECKS is a bit annoying because I'd rather have md5 checks, but I guess I can't have everything) ^^' The patch is not perfect, but it does the job.

BlackHC avatar Aug 19 '24 15:08 BlackHC