Quentin Lhoest

Hugging Face Paris ML Engineer @huggingface | Maintainer of 🤗Datasets

Results 416 comments of


                                            Quentin Lhoest

Expanding streaming capabilities

That makes sense @athewsey , thanks for the suggestion :) Maybe instead of the `to_disk` we could simply have `save_to_disk` instead: ```python streaming_dataset.save_to_disk("path/to/my/dataset/dir") on_disk_dataset = load_from_disk("path/to/my/dataset/dir") in_memory_dataset = Dataset.from_list(list(streaming_dataset.take(100))) #...

Expanding streaming capabilities

So far are implemented: `IterableDataset.filter()` and `Dataset.to_iterable_dataset()`. Still missing: `IterableDataset.push_to_hub()` - though there is a hack to write on disk and then push to hub using ```python ds_on_disk = Dataset.from_generator(streaming_ds.__iter__)...

Allow downloading just some columns of a dataset

In the general case you can’t always reduce the quantity of data to download, since you can’t parse CSV or JSON data without downloading the whole files right ? ^^...

Allow downloading just some columns of a dataset

Passing `columns=[...]` to `load_dataset()` in streaming mode does work if the dataset is in Parquet format, but for other formats it's either not possible or not implemented

Allow downloading just some columns of a dataset

It only works for `streaming=True`. When not streaming it does download the full files locally before reading the data

Errror when saving to disk a dataset of images

Another option could be to use `pa.large_binary` instead of `pa.binary` in certain cases ?

Dependencies: exclude `datasets` and `huggingface_hub` major version updates

as you prefer, just take into account that breaking changes might happen in major versions (not sure what kind for hfh, but for `datasets` it may include the VideoFrame type...

Create dataset british_library_hertiage_made_digital_newspapers

The dataset looks fine as ZIP, maybe we could optimize the data format inference so that it doesn't have to iterate over each single zip file. We can decide on...

Be able to stream the results of query

With datasets-server we'll store all the datasets as parquet, so you'll be able to use duckdb on every dataset, streaming the result from the remote parquet files

Integrating MDS Streaming with HF Dataset Streaming

It would be nice to stream datasets from HF using Streaming, e.g. supporting [hf://](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system) paths

‹
1
2
...
33
34
35
36
37
38
39
40
41
42
›