Quentin Lhoest comments

Results 453 comments of


                                            Quentin Lhoest

[Audio] Path of Common Voice cannot be used for audio loading anymore

From https://github.com/huggingface/datasets/pull/3736 the Common Voice dataset now gives access to the local audio files as before

'sort' method sorts one column only

Hi ! `ds.sort()` does sort the full dataset, not just one column: ```python from datasets import * ds = Dataset.from_dict({"foo": [3, 2, 1], "bar": ["c", "b", "a"]}) print(d.sort("foo").to_pandas() # foo...

'sort' method sorts one column only

That's unexpected, can you share the code you used to get this ?

Update local loading script docs

Oh you're right. Calling `load_dataset` on the modified script without having the files that come with it is not ideal. I agree it should be `git clone` instead - and...

`load_dataset` consumes too much memory for audio + tar archives

Hi ! Could it be because you need to free the memory used by `tarfile` by emptying the tar `members` by any chance ? ```python yield key, {"audio": {"path": audio_name,...

`load_dataset` consumes too much memory for audio + tar archives

I also run out of memory when loading `mozilla-foundation/common_voice_8_0` that also uses `tarfile` via `dl_manager.iter_archive`. There seems to be some data files that stay in memory somewhere I don't have...

`load_dataset` consumes too much memory for audio + tar archives

For video datasets I think you can just define the max number of video that can stay in memory by adding this class attribute to your dataset builer: ```py DEFAULT_WRITER_BATCH_SIZE...

`load_dataset` consumes too much memory for audio + tar archives

> I'll add that I'm encountering the same issue with > load_dataset('wikipedia', 'ceb', runner='DirectRunner', split='train'). > Same for 'es' in place of 'ceb'. This is because the Apache Beam `DirectRunner`...

`load_dataset` consumes too much memory for audio + tar archives

> Fair enough, but this line of code crashed an AWS instance with 1024GB of RAM! What, wikipedia is not even bigger than 20GB cc @albertvillanova

`load_dataset` consumes too much memory for audio + tar archives

I found the issue with Common Voice 8 and opened a PR to fix it: https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0/discussions/2 Basically the `metadata` dict that contains the transcripts per audio file was continuously getting...