Support downloading specific splits in `load_dataset`
This PR builds on https://github.com/huggingface/datasets/pull/6639 to support downloading only the specified splits in load_dataset. For this to work, a builder's _split_generators need to be able to accept the requested splits (as a list) via a splits argument to avoid processing the non-requested ones. Also, the builder has to define a _available_splits method that lists all the possible splits values.
Close https://github.com/huggingface/datasets/issues/4101, close https://github.com/huggingface/datasets/issues/2538 (I'm probably missing some)
Should also make it possible to address https://github.com/huggingface/datasets/issues/6793
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Friendly ping on this! This feature would be really helpful and useful to me (and likely others with limited download speed and storage space!). Thanks so much!
No one is working on this atm afaik :/
No worries! I've patched the ImageNet dataset in: https://huggingface.co/datasets/ILSVRC/imagenet-1k/blob/refs%2Fpr%2F20/imagenet-1k.py
Together with:
dataset = load_dataset(
"ILSVRC/imagenet-1k",
split="validation",
data_files={"val": "data/val_images.tar.gz"},
revision="refs/pr/20",
trust_remote_code=True,
download_config=DownloadConfig(resume_download=True),
verification_mode=VerificationMode.NO_CHECKS,
)
It only downloads the validation set this way (NO_CHECKS is a bit annoying because I'd rather have md5 checks, but I guess I can't have everything) ^^' The patch is not perfect, but it does the job.