Memory leak when streaming
Describe the bug
I try to use a dataset with streaming=True, the issue I have is that the RAM usage becomes higher and higher until it is no longer sustainable.
I understand that huggingface store data in ram during the streaming, and more worker in dataloader there are, more a lot of shard will be stored in ram, but the issue I have is that the ram usage is not constant. So after each new shard loaded, the ram usage will be higher and higher.
Steps to reproduce the bug
You can run this code and see you ram usage, after each shard of 255 examples, your ram usage will be extended.
from datasets import load_dataset
from torch.utils.data import DataLoader
dataset = load_dataset("WaveGenAI/dataset", streaming=True)
dataloader = DataLoader(dataset["train"], num_workers=3)
for i, data in enumerate(dataloader):
print(i, end="\r")
Expected behavior
The Ram usage should be always the same (just 3 shards loaded in the ram).
Environment info
datasetsversion: 3.0.1- Platform: Linux-6.10.5-arch1-1-x86_64-with-glibc2.40
- Python version: 3.12.4
huggingface_hubversion: 0.26.0- PyArrow version: 17.0.0
- Pandas version: 2.2.3
fsspecversion: 2024.6.1
I seem to have encountered the same problem when loading non streaming datasets. load_from_disk. Causing hundreds of GB of memory, but the dataset actually only has 50GB
FYI when streaming parquet data, only one row group per worker is loaded in memory at a time.
Btw for datasets of embeddings you can surely optimize your RAM by reading the data as torch tensors directly instead of the default python lists
from datasets import load_dataset
from torch.utils.data import DataLoader
dataset = load_dataset("WaveGenAI/dataset", streaming=True).with_format("torch")
dataloader = DataLoader(dataset["train"], num_workers=3)
Im also, hitting this issue.....
# This is what's causing the leak:
batch_datasets = []
for file_path in batch_files:
dataset = load_dataset(..., streaming=True)
shuffled_dataset = dataset.shuffle(seed=42, buffer_size=1000) # 1000-item buffer
batch_datasets.append(shuffled_dataset) # Buffer persists
interleaved_dataset = interleave_datasets(batch_datasets, seed=42)
And, nothing helps
del batch_datasets, interleaved_dataset
gc.collect() # This doesn't work for HuggingFace internal memory structures
so my guess is that they wrote this in RUST and forgot to clean up!!!
Now, if i remove the interleaving and process files sequentially... like this it still leaks
# Process files one by one - no batching, no interleaving
for file_idx, file_path in enumerate(file_paths):
dataset = load_dataset("parquet", data_files=file_path, split="train", streaming=True)
shuffled_dataset = dataset.shuffle(seed=42, buffer_size=1000)
for record in shuffled_dataset:
# Process record immediately
pass
del dataset, shuffled_dataset
gc.collect()
- File 1: 42.4% memory
- File 2: 42.5% memory
- File 3: 42.5% memory
- File 4: 48.4% memory (+6%)
- File 5: 52.7% memory (+4.3%)
- File 6: 56.7% memory (+4%)
- File 7: 59.6% memory (+2.9%)
- File 8: 62.0% memory (+2.4%)
I had to go back to sequential shuffling (NO Interleaving) and clean up like this
dataset.cleanup_cache_files()
del dataset, shuffled_dataset
gc.collect()
pa.default_memory_pool().release_unused()
libc.malloc_trim(0) # when available
i have also observed these memory leaks inside the huggingface library when developing bghira/captionflow and had the same outcome of being unable to actually free anything when it occurs. i've worked around it by avoiding some of the more damaging parts of the library, but in doing so i've essentially restricted the compatibility levels of the project.
Could it be a leak from PyArrow which is used to stream the data from the Parquet files ?
i believe it's heavily involved yeah
Is there any update on this? Iām seeing the same issue when using multiple streaming datasets for long runs.
see https://github.com/bghira/webshart for a Rust-based implementation (relatively bare minimum for my needs, sorry - open to PRs) that helped me to workaround the problem.
Thank you āš»
Btw since a Dataset uses memory mapped Arrow files, iterating on the data does iterate on memory mapped files which loads pages in RAM progressively but discards them once they are not needed anymore. Generally, the OS does it automatically when memory is needed for something else. So you might see your RSS (memory) go up but it won't OOM since data are paged out automatically by the OS.
Hi @lhoestq
It does go OOM since most of the memory occupied does not free up.
Eg: For my case, the compressed data size is about 400 GB with over 100 datasets. Still, the occupied memory goes up to over a TB for longer runs. I think it's most likely related to the dataloader buffer; every time it is refreshed, it fails to free up the occupied memory by the previous buffer.
MAX MEM: 908 Gbytes; AVG MEM: 843 Gbytes;