datasets
datasets copied to clipboard
`load_dataset` consumes too much memory for audio + tar archives
Description
load_dataset
consumes more and more memory until it's killed, even though it's made with a generator. I'm adding a loading script for a new dataset, made up of ~15s audio coming from a tar file. Tried setting DEFAULT_WRITER_BATCH_SIZE = 1
as per the discussion in #741 but the problem persists.
Steps to reproduce the bug
Here's my implementation of _generate_examples
:
class MyDatasetBuilder(datasets.GeneratorBasedBuilder):
DEFAULT_WRITER_BATCH_SIZE = 1
...
def _split_generators(self, dl_manager):
archive_path = dl_manager.download(_DL_URLS[self.config.name])
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"audio_tarfile_path": archive_path["audio_tarfile"]
},
),
]
def _generate_examples(self, audio_tarfile_path):
key = 0
with tarfile.open(audio_tarfile_path, mode="r|") as audio_tarfile:
for audio_tarinfo in audio_tarfile:
audio_name = audio_tarinfo.name
audio_file_obj = audio_tarfile.extractfile(audio_tarinfo)
yield key, {"audio": {"path": audio_name, "bytes": audio_file_obj.read()}}
key += 1
I then try to load via ds = load_dataset('./datasets/my_new_dataset', writer_batch_size=1)
, and memory usage grows until all 8GB of my machine are taken and process is killed (Killed
). Also tried an untarred version of this using os.walk
but the same happened.
I created a script to confirm that one can safely go through such a generator, which runs just fine with memory <500MB at all times.
import tarfile
def generate_examples():
audio_tarfile = tarfile.open("audios.tar", mode="r|")
key = 0
for audio_tarinfo in audio_tarfile:
audio_name = audio_tarinfo.name
audio_file_obj = audio_tarfile.extractfile(audio_tarinfo)
yield key, {"audio": {"path": audio_name, "bytes": audio_file_obj.read()}}
key += 1
if __name__ == "__main__":
examples = generate_examples()
for example in examples:
pass
Expected results
Memory consumption should be similar to the non-huggingface script.
Actual results
Process is killed after consuming too much memory.
Environment info
-
datasets
version: 2.0.1.dev0 - Platform: Linux-4.19.0-20-cloud-amd64-x86_64-with-debian-10.12
- Python version: 3.7.12
- PyArrow version: 6.0.1
- Pandas version: 1.3.5
Hi ! Could it be because you need to free the memory used by tarfile
by emptying the tar members
by any chance ?
yield key, {"audio": {"path": audio_name, "bytes": audio_file_obj.read()}}
audio_tarfile.members = [] # free memory
key += 1
and then you can set DEFAULT_WRITER_BATCH_SIZE
to whatever value makes more sense for your dataset.
Let me know if the issue persists (which could happen, given that you managed to run your generator without RAM issues and using os.walk didn't solve the issue)
Thanks for your reply! Tried it but the issue persists.
I also run out of memory when loading mozilla-foundation/common_voice_8_0
that also uses tarfile
via dl_manager.iter_archive
. There seems to be some data files that stay in memory somewhere
I don't have the issue with other compression formats like gzipped files
I'm facing a similar memory leak issue when loading cv8. As you said @lhoestq
load_dataset("mozilla-foundation/common_voice_8_0", "en", use_auth_token=True, writer_batch_size=1)
This issue is happening on a 32GB RAM machine.
Any updates on how to fix this?
I've run a memory profiler to see where's the leak comes from:
... it seems that it's related to the tarfile lib buffer reader. But I don't know why it's only happening on the huggingface script
I have the same problem when loading video into numpy.
yield id,{
"video": imageio.v3.imread(video_path),
"label": int(label)
}
Since video files are heavy, it can only processes a dozen samples before OOM.
For video datasets I think you can just define the max number of video that can stay in memory by adding this class attribute to your dataset builer:
DEFAULT_WRITER_BATCH_SIZE = 8 # only 8 videos at a time in memory before flushing the dataset writer
same thing happens for me with load_dataset("mozilla-foundation/common_voice_8_0", "en", use_auth_token=True, writer_batch_size=1)
on azure ml. seems to fill up tmp
and not release that memory until OOM
I'll add that I'm encountering the same issue with
load_dataset('wikipedia', 'ceb', runner='DirectRunner', split='train')
.
Same for 'es'
in place of 'ceb'
.
I'll add that I'm encountering the same issue with load_dataset('wikipedia', 'ceb', runner='DirectRunner', split='train'). Same for 'es' in place of 'ceb'.
This is because the Apache Beam DirectRunner
runs with the full data in memory unfortunately. Optimizing the DirectRunner
is not in the scope of the datasets
library, but rather in the Apache Beam project I believe. If you have memory issues with the DirectRunner
, please consider switching to a machine with more RAM, or to distributed processing runtimes like Spark, Flink or DataFlow. There is a bit of documentation here: https://huggingface.co/docs/datasets/beam
I'll add that I'm encountering the same issue with
load_dataset('wikipedia', 'ceb', runner='DirectRunner', split='train')
. Same for'es'
in place of'ceb'
.This is because the Apache Beam
DirectRunner
runs with the full data in memory unfortunately. Optimizing theDirectRunner
is not in the scope of thedatasets
library, but rather in the Apache Beam project I believe. If you have memory issues with theDirectRunner
, please consider switching to a machine with more RAM, or to distributed processing runtimes like Spark, Flink or DataFlow. There is a bit of documentation here: https://huggingface.co/docs/datasets/beam
Fair enough, but this line of code crashed an AWS instance with 1024GB of RAM! I have also tried with Runner='Flink'
on an environment with 51GB of RAM, which also failed.
Apache Beam has tons of open tickets already - is it worth submitting one to them over this?
Fair enough, but this line of code crashed an AWS instance with 1024GB of RAM!
What, wikipedia is not even bigger than 20GB
cc @albertvillanova
Fair enough, but this line of code crashed an AWS instance with 1024GB of RAM!
What, wikipedia is not even bigger than 20GB
cc @albertvillanova
Luckily, on Colab you can watch the call stack at the bottom of the screen - much of the time and space complexity seems to come from _parse_and_clean_wikicode()
rather than the actual download process. As far as I can tell, the script is loading the full dataset and then cleaning it all at once, which is consuming a lot of memory.
I think we are mixing many different bugs in this Issue page:
- TAR archive with audio files
- video file
- distributed parsing of Wikipedia using Apache Beam
@dan-the-meme-man may I ask you to open a separate Issue for your problem? Then I will address it. It is important to fix it because we are currently working on a Datasets enhancement to be able to provide all Wikipedias already preprocessed.
On the other hand, I think we could keep this Issue page for the original problem: TAR archive with audio files. That is not fixed yet either.
Is there an update on the TAR archive issue with audio files? Happy to lend a hand in fixing this :)
I found the issue with Common Voice 8 and opened a PR to fix it: https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0/discussions/2
Basically the metadata
dict that contains the transcripts per audio file was continuously getting filled with bytes from f.read()
because of this code:
result = metadata[path]
result["audio"] = {"path": path, "bytes": f.read()}
copying the result with result = dict(metadata[path])
fixes it: the bytes are no longer added to metadata
I also opened PRs to the other CV datasets
Amazing, that's a great find! Thanks @lhoestq!
I'm closing this one for now, but feel free to reopen if you encounter other memory issues with audio datasets