Quentin Lhoest

Results 416 comments of Quentin Lhoest

That makes sense @athewsey , thanks for the suggestion :) Maybe instead of the `to_disk` we could simply have `save_to_disk` instead: ```python streaming_dataset.save_to_disk("path/to/my/dataset/dir") on_disk_dataset = load_from_disk("path/to/my/dataset/dir") in_memory_dataset = Dataset.from_list(list(streaming_dataset.take(100))) #...

So far are implemented: `IterableDataset.filter()` and `Dataset.to_iterable_dataset()`. Still missing: `IterableDataset.push_to_hub()` - though there is a hack to write on disk and then push to hub using ```python ds_on_disk = Dataset.from_generator(streaming_ds.__iter__)...

In the general case you can’t always reduce the quantity of data to download, since you can’t parse CSV or JSON data without downloading the whole files right ? ^^...

Passing `columns=[...]` to `load_dataset()` in streaming mode does work if the dataset is in Parquet format, but for other formats it's either not possible or not implemented

It only works for `streaming=True`. When not streaming it does download the full files locally before reading the data

Another option could be to use `pa.large_binary` instead of `pa.binary` in certain cases ?

as you prefer, just take into account that breaking changes might happen in major versions (not sure what kind for hfh, but for `datasets` it may include the VideoFrame type...

The dataset looks fine as ZIP, maybe we could optimize the data format inference so that it doesn't have to iterate over each single zip file. We can decide on...

With datasets-server we'll store all the datasets as parquet, so you'll be able to use duckdb on every dataset, streaming the result from the remote parquet files

It would be nice to stream datasets from HF using Streaming, e.g. supporting [hf://](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system) paths