Quentin Lhoest

Results 416 comments of Quentin Lhoest

I'm closing this one for now, but feel free to reopen if you encounter other memory issues with audio datasets

Hi ! Note that it does `block_size *= 2` until `block_size > len(batch)`, so it doesn't loop indefinitely. What do you mean by "get stuck indefinitely" then ? Is this...

I couldn't I think your problem is unrelated to this issue @memray Indeed this issue discusses a bug when doing `load_dataset`, while your case has to do with the dataloader...

Added some docs for dask and took your comments into account cc @philschmid if you also want to take a look :)

Just noticed that it would be more convenient to pass the output dir to download_and_prepare directly, to bypass the caching logic which prepares the dataset at `////`. And this way...

Alright I did the last change I wanted to do, here is the final API: ```python builder = load_dataset_builder(...) builder.download_and_prepare("s3://...", storage_options={"token": ...}) ``` and it creates the arrow files directly...

totally agree with your comment on the meaning of "loading", I'll update the docs

I took your comments into account and reverted all the changes related to `cache_dir` to keep the support for remote `cache_dir` for beam datasets. I also updated the wording in...

Hi ! According to https://huggingface.co/docs/datasets/use_with_pytorch#stream-data you can use the pytorch DataLoader with `num_workers>0` to distribute the shards across your workers (it uses `torch.utils.data.get_worker_info()` to get the worker ID and select...

Hi ! As mentioned on the [forum](https://discuss.huggingface.co/t/how-to-wrap-a-generator-with-hf-dataset/18464), the simplest for now would be to define a [dataset script](https://huggingface.co/docs/datasets/dataset_script) which can contain your generator. But we can also explore adding something...