datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Feature request: IterableDataset.push_to_hub

Open NielsRogge opened this issue 2 years ago • 5 comments

Feature request

It'd be great to have a lazy push to hub, similar to the lazy loading we have with IterableDataset.

Suppose you'd like to filter LAION based on certain conditions, but as LAION doesn't fit into your disk, you'd like to leverage streaming:

from datasets import load_dataset

dataset = load_dataset("laion/laion400m", streaming=True, split="train")

Then you could filter the dataset based on certain conditions:

filtered_dataset = dataset.filter(lambda example: example['HEIGHT'] > 400)

In order to persist this dataset and push it back to the hub, one currently needs to first load the entire filtered dataset on disk and then push:

from datasets import Dataset

Dataset.from_generator(filtered_dataset.__iter__).push_to_hub(...)

It would be great if we can instead lazy push to the data to the hub (basically stream the data to the hub), not being limited by our disk size:

filtered_dataset.push_to_hub("my-filtered-dataset")

Motivation

This feature would be very useful for people that want to filter huge datasets without having to load the entire dataset or a filtered version thereof on their local disk.

Your contribution

Happy to test out a PR :)

NielsRogge avatar Mar 23 '23 09:03 NielsRogge

+1

ducha-aiki avatar May 31 '24 15:05 ducha-aiki

+1

phineas-pta avatar Jul 07 '24 07:07 phineas-pta

+1, should be possible now? :) https://huggingface.co/blog/xethub-joins-hf

Jourdelune avatar Aug 20 '24 19:08 Jourdelune

Haha we're working hard to integrate Xet in the HF back-end, it will enable cool use cases :)

Anyway about IterableDataset.push_to_hub, I'd be happy to to provide guidance and answer questions if anyone wants to start a first simple implementation of this

lhoestq avatar Aug 21 '24 15:08 lhoestq

+1

meg-huggingface avatar Aug 29 '24 23:08 meg-huggingface

+1

pkoperek avatar Nov 15 '24 13:11 pkoperek

+1

girivad avatar Jan 30 '25 05:01 girivad

+1

Currently running into this when filtering Common Corpus for Dutch entries.

Extra points for somehow making it resumable on error. 11 TB is a lot of data to stream on a home connection without encountering any sort of errors along the way.

Rijgersberg avatar Feb 13 '25 06:02 Rijgersberg

If it can help, IterableDataset does implement .state_dict() and .load_state_dict() that you can use to resume a stream already

lhoestq avatar Feb 13 '25 10:02 lhoestq

+1

tolgadur avatar Mar 12 '25 13:03 tolgadur

+1

andstor avatar Apr 05 '25 15:04 andstor

+1

Peter-Devine avatar May 15 '25 14:05 Peter-Devine

Just added a first implementation for IterableDataset.push_to_hub() :)

I'll do a new release soon, in the meantime feel free to install datasets from source to try it out !

lhoestq avatar Jun 06 '25 16:06 lhoestq