datasets `load_dataset` consumes too much memory for audio + tar archives

Description

load_dataset consumes more and more memory until it's killed, even though it's made with a generator. I'm adding a loading script for a new dataset, made up of ~15s audio coming from a tar file. Tried setting DEFAULT_WRITER_BATCH_SIZE = 1 as per the discussion in #741 but the problem persists.

Steps to reproduce the bug

Here's my implementation of _generate_examples:

class MyDatasetBuilder(datasets.GeneratorBasedBuilder):
    DEFAULT_WRITER_BATCH_SIZE = 1
    ...

    def _split_generators(self, dl_manager):
        archive_path = dl_manager.download(_DL_URLS[self.config.name])
            return [
                datasets.SplitGenerator(
                    name=datasets.Split.TRAIN,
                    gen_kwargs={
                        "audio_tarfile_path": archive_path["audio_tarfile"]
                    },
                ),
            ]
    
    def _generate_examples(self, audio_tarfile_path):
        key = 0
        with tarfile.open(audio_tarfile_path, mode="r|") as audio_tarfile:
            for audio_tarinfo in audio_tarfile:
                audio_name = audio_tarinfo.name
                audio_file_obj = audio_tarfile.extractfile(audio_tarinfo)
                yield key, {"audio": {"path": audio_name, "bytes": audio_file_obj.read()}}
                key += 1

I then try to load via ds = load_dataset('./datasets/my_new_dataset', writer_batch_size=1), and memory usage grows until all 8GB of my machine are taken and process is killed (Killed). Also tried an untarred version of this using os.walk but the same happened.

I created a script to confirm that one can safely go through such a generator, which runs just fine with memory <500MB at all times.

import tarfile

def generate_examples():
    audio_tarfile = tarfile.open("audios.tar", mode="r|")
    key = 0
    for audio_tarinfo in audio_tarfile:
        audio_name = audio_tarinfo.name
        audio_file_obj = audio_tarfile.extractfile(audio_tarinfo)
        yield key, {"audio": {"path": audio_name, "bytes": audio_file_obj.read()}}
        key += 1

if __name__ == "__main__":
    examples = generate_examples()
    for example in examples:
        pass

Expected results

Memory consumption should be similar to the non-huggingface script.

Actual results

Process is killed after consuming too much memory.

Environment info

datasets version: 2.0.1.dev0
Platform: Linux-4.19.0-20-cloud-amd64-x86_64-with-debian-10.12
Python version: 3.7.12
PyArrow version: 6.0.1
Pandas version: 1.3.5

Mar 29 '22 21:03 JFCeron

Hi ! Could it be because you need to free the memory used by tarfile by emptying the tar members by any chance ?

        yield key, {"audio": {"path": audio_name, "bytes": audio_file_obj.read()}}
        audio_tarfile.members = []  # free memory
        key += 1

and then you can set DEFAULT_WRITER_BATCH_SIZE to whatever value makes more sense for your dataset.

Let me know if the issue persists (which could happen, given that you managed to run your generator without RAM issues and using os.walk didn't solve the issue)

Mar 30 '22 14:03 lhoestq

Thanks for your reply! Tried it but the issue persists.

Mar 30 '22 15:03 JFCeron

I also run out of memory when loading mozilla-foundation/common_voice_8_0 that also uses tarfile via dl_manager.iter_archive. There seems to be some data files that stay in memory somewhere

I don't have the issue with other compression formats like gzipped files

Apr 07 '22 16:04 lhoestq

I'm facing a similar memory leak issue when loading cv8. As you said @lhoestq

load_dataset("mozilla-foundation/common_voice_8_0", "en", use_auth_token=True, writer_batch_size=1)

This issue is happening on a 32GB RAM machine.

Any updates on how to fix this?

Apr 24 '22 16:04 jonatasgrosman

I've run a memory profiler to see where's the leak comes from:

... it seems that it's related to the tarfile lib buffer reader. But I don't know why it's only happening on the huggingface script

Apr 25 '22 14:04 jonatasgrosman

I have the same problem when loading video into numpy.

yield id,{ 
    "video": imageio.v3.imread(video_path),
    "label": int(label)
}

Since video files are heavy, it can only processes a dozen samples before OOM.

May 13 '22 04:05 lkhphuc

For video datasets I think you can just define the max number of video that can stay in memory by adding this class attribute to your dataset builer:

DEFAULT_WRITER_BATCH_SIZE = 8  # only 8 videos at a time in memory before flushing the dataset writer

May 13 '22 16:05 lhoestq

same thing happens for me with load_dataset("mozilla-foundation/common_voice_8_0", "en", use_auth_token=True, writer_batch_size=1) on azure ml. seems to fill up tmp and not release that memory until OOM

Jun 02 '22 10:06 ghost

I'll add that I'm encountering the same issue with load_dataset('wikipedia', 'ceb', runner='DirectRunner', split='train'). Same for 'es' in place of 'ceb'.

Jun 23 '22 02:06 dan-the-meme-man

I'll add that I'm encountering the same issue with load_dataset('wikipedia', 'ceb', runner='DirectRunner', split='train'). Same for 'es' in place of 'ceb'.

This is because the Apache Beam DirectRunner runs with the full data in memory unfortunately. Optimizing the DirectRunner is not in the scope of the datasets library, but rather in the Apache Beam project I believe. If you have memory issues with the DirectRunner, please consider switching to a machine with more RAM, or to distributed processing runtimes like Spark, Flink or DataFlow. There is a bit of documentation here: https://huggingface.co/docs/datasets/beam

Jun 23 '22 13:06 lhoestq

I'll add that I'm encountering the same issue with load_dataset('wikipedia', 'ceb', runner='DirectRunner', split='train'). Same for 'es' in place of 'ceb'.

This is because the Apache Beam DirectRunner runs with the full data in memory unfortunately. Optimizing the DirectRunner is not in the scope of the datasets library, but rather in the Apache Beam project I believe. If you have memory issues with the DirectRunner, please consider switching to a machine with more RAM, or to distributed processing runtimes like Spark, Flink or DataFlow. There is a bit of documentation here: https://huggingface.co/docs/datasets/beam

Fair enough, but this line of code crashed an AWS instance with 1024GB of RAM! I have also tried with Runner='Flink' on an environment with 51GB of RAM, which also failed.

Apache Beam has tons of open tickets already - is it worth submitting one to them over this?

Jun 23 '22 21:06 dan-the-meme-man

Fair enough, but this line of code crashed an AWS instance with 1024GB of RAM!

What, wikipedia is not even bigger than 20GB

cc @albertvillanova

Jun 28 '22 17:06 lhoestq

Fair enough, but this line of code crashed an AWS instance with 1024GB of RAM!

What, wikipedia is not even bigger than 20GB

cc @albertvillanova

Luckily, on Colab you can watch the call stack at the bottom of the screen - much of the time and space complexity seems to come from _parse_and_clean_wikicode() rather than the actual download process. As far as I can tell, the script is loading the full dataset and then cleaning it all at once, which is consuming a lot of memory.

Jun 30 '22 03:06 dan-the-meme-man

I think we are mixing many different bugs in this Issue page:

TAR archive with audio files
video file
distributed parsing of Wikipedia using Apache Beam

@dan-the-meme-man may I ask you to open a separate Issue for your problem? Then I will address it. It is important to fix it because we are currently working on a Datasets enhancement to be able to provide all Wikipedias already preprocessed.

On the other hand, I think we could keep this Issue page for the original problem: TAR archive with audio files. That is not fixed yet either.

Jun 30 '22 09:06 albertvillanova

Is there an update on the TAR archive issue with audio files? Happy to lend a hand in fixing this :)

Jul 21 '22 06:07 sanchit-gandhi

I found the issue with Common Voice 8 and opened a PR to fix it: https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0/discussions/2

Basically the metadata dict that contains the transcripts per audio file was continuously getting filled with bytes from f.read() because of this code:

result = metadata[path]
result["audio"] = {"path": path, "bytes": f.read()}

copying the result with result = dict(metadata[path]) fixes it: the bytes are no longer added to metadata

I also opened PRs to the other CV datasets

Jul 28 '22 14:07 lhoestq

Amazing, that's a great find! Thanks @lhoestq!

Jul 28 '22 15:07 sanchit-gandhi

I'm closing this one for now, but feel free to reopen if you encounter other memory issues with audio datasets

Aug 16 '22 10:08 lhoestq

datasets datasets copied to clipboard

`load_dataset` consumes too much memory for audio + tar archives

Description

Steps to reproduce the bug

Expected results

Actual results

Environment info

datasets
datasets copied to clipboard