Quentin Lhoest comments

Results 416 comments of


                                            Quentin Lhoest

`load_dataset` consumes too much memory for audio + tar archives

I'm closing this one for now, but feel free to reopen if you encounter other memory issues with audio datasets

Loading JSON gets stuck with many workers/threads

Hi ! Note that it does `block_size *= 2` until `block_size > len(batch)`, so it doesn't loop indefinitely. What do you mean by "get stuck indefinitely" then ? Is this...

Loading JSON gets stuck with many workers/threads

I couldn't I think your problem is unrelated to this issue @memray Indeed this issue discusses a bug when doing `load_dataset`, while your case has to do with the dataloader...

Download and prepare as Parquet for cloud storage

Added some docs for dask and took your comments into account cc @philschmid if you also want to take a look :)

Download and prepare as Parquet for cloud storage

Just noticed that it would be more convenient to pass the output dir to download_and_prepare directly, to bypass the caching logic which prepares the dataset at `////`. And this way...

Download and prepare as Parquet for cloud storage

Alright I did the last change I wanted to do, here is the final API: ```python builder = load_dataset_builder(...) builder.download_and_prepare("s3://...", storage_options={"token": ...}) ``` and it creates the arrow files directly...

Download and prepare as Parquet for cloud storage

totally agree with your comment on the meaning of "loading", I'll update the docs

Download and prepare as Parquet for cloud storage

I took your comments into account and reverted all the changes related to `cache_dir` to keep the support for remote `cache_dir` for beam datasets. I also updated the wording in...

Distributed data parallel training for streaming datasets

Hi ! According to https://huggingface.co/docs/datasets/use_with_pytorch#stream-data you can use the pytorch DataLoader with `num_workers>0` to distribute the shards across your workers (it uses `torch.utils.data.get_worker_info()` to get the worker ID and select...

how to convert a dict generator into a huggingface dataset.

Hi ! As mentioned on the [forum](https://discuss.huggingface.co/t/how-to-wrap-a-generator-with-hf-dataset/18464), the simplest for now would be to define a [dataset script](https://huggingface.co/docs/datasets/dataset_script) which can contain your generator. But we can also explore adding something...