Quentin Lhoest
Quentin Lhoest
I'm closing this one for now, but feel free to reopen if you encounter other memory issues with audio datasets
Hi ! Note that it does `block_size *= 2` until `block_size > len(batch)`, so it doesn't loop indefinitely. What do you mean by "get stuck indefinitely" then ? Is this...
I couldn't I think your problem is unrelated to this issue @memray Indeed this issue discusses a bug when doing `load_dataset`, while your case has to do with the dataloader...
Added some docs for dask and took your comments into account cc @philschmid if you also want to take a look :)
Just noticed that it would be more convenient to pass the output dir to download_and_prepare directly, to bypass the caching logic which prepares the dataset at `////`. And this way...
Alright I did the last change I wanted to do, here is the final API: ```python builder = load_dataset_builder(...) builder.download_and_prepare("s3://...", storage_options={"token": ...}) ``` and it creates the arrow files directly...
totally agree with your comment on the meaning of "loading", I'll update the docs
I took your comments into account and reverted all the changes related to `cache_dir` to keep the support for remote `cache_dir` for beam datasets. I also updated the wording in...
Hi ! According to https://huggingface.co/docs/datasets/use_with_pytorch#stream-data you can use the pytorch DataLoader with `num_workers>0` to distribute the shards across your workers (it uses `torch.utils.data.get_worker_info()` to get the worker ID and select...
Hi ! As mentioned on the [forum](https://discuss.huggingface.co/t/how-to-wrap-a-generator-with-hf-dataset/18464), the simplest for now would be to define a [dataset script](https://huggingface.co/docs/datasets/dataset_script) which can contain your generator. But we can also explore adding something...