datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Memory leak when streaming

Open Jourdelune opened this issue 1 year ago • 11 comments

Describe the bug

I try to use a dataset with streaming=True, the issue I have is that the RAM usage becomes higher and higher until it is no longer sustainable.

I understand that huggingface store data in ram during the streaming, and more worker in dataloader there are, more a lot of shard will be stored in ram, but the issue I have is that the ram usage is not constant. So after each new shard loaded, the ram usage will be higher and higher.

Steps to reproduce the bug

You can run this code and see you ram usage, after each shard of 255 examples, your ram usage will be extended.

from datasets import load_dataset
from torch.utils.data import DataLoader

dataset = load_dataset("WaveGenAI/dataset", streaming=True)

dataloader = DataLoader(dataset["train"], num_workers=3)

for i, data in enumerate(dataloader):
    print(i, end="\r")

Expected behavior

The Ram usage should be always the same (just 3 shards loaded in the ram).

Environment info

  • datasets version: 3.0.1
  • Platform: Linux-6.10.5-arch1-1-x86_64-with-glibc2.40
  • Python version: 3.12.4
  • huggingface_hub version: 0.26.0
  • PyArrow version: 17.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.6.1

Jourdelune avatar Oct 31 '24 13:10 Jourdelune

I seem to have encountered the same problem when loading non streaming datasets. load_from_disk. Causing hundreds of GB of memory, but the dataset actually only has 50GB

enze5088 avatar Nov 07 '24 15:11 enze5088

FYI when streaming parquet data, only one row group per worker is loaded in memory at a time.

Btw for datasets of embeddings you can surely optimize your RAM by reading the data as torch tensors directly instead of the default python lists

from datasets import load_dataset
from torch.utils.data import DataLoader

dataset = load_dataset("WaveGenAI/dataset", streaming=True).with_format("torch")

dataloader = DataLoader(dataset["train"], num_workers=3)

lhoestq avatar Nov 18 '24 11:11 lhoestq

Im also, hitting this issue.....

 # This is what's causing the leak:
  batch_datasets = []
  for file_path in batch_files:
      dataset = load_dataset(..., streaming=True)
      shuffled_dataset = dataset.shuffle(seed=42, buffer_size=1000)  # 1000-item buffer
      batch_datasets.append(shuffled_dataset)  # Buffer persists

  interleaved_dataset = interleave_datasets(batch_datasets, seed=42)  

And, nothing helps

      del batch_datasets, interleaved_dataset
      gc.collect()  # This doesn't work for HuggingFace internal memory structures

so my guess is that they wrote this in RUST and forgot to clean up!!!

Now, if i remove the interleaving and process files sequentially... like this it still leaks


     # Process files one by one - no batching, no interleaving
     for file_idx, file_path in enumerate(file_paths):
         dataset = load_dataset("parquet", data_files=file_path, split="train", streaming=True)
         shuffled_dataset = dataset.shuffle(seed=42, buffer_size=1000) 
         
         for record in shuffled_dataset:
             # Process record immediately
             pass
         
         del dataset, shuffled_dataset
         gc.collect()
  • File 1: 42.4% memory
  • File 2: 42.5% memory
  • File 3: 42.5% memory
  • File 4: 48.4% memory (+6%)
  • File 5: 52.7% memory (+4.3%)
  • File 6: 56.7% memory (+4%)
  • File 7: 59.6% memory (+2.9%)
  • File 8: 62.0% memory (+2.4%)

I had to go back to sequential shuffling (NO Interleaving) and clean up like this

     dataset.cleanup_cache_files()                                                         
         del dataset, shuffled_dataset                                                         
         gc.collect()                                                                          
         pa.default_memory_pool().release_unused()                                             
         libc.malloc_trim(0)  # when available   

SuperSonnix71 avatar Aug 05 '25 11:08 SuperSonnix71

i have also observed these memory leaks inside the huggingface library when developing bghira/captionflow and had the same outcome of being unable to actually free anything when it occurs. i've worked around it by avoiding some of the more damaging parts of the library, but in doing so i've essentially restricted the compatibility levels of the project.

bghira avatar Aug 29 '25 18:08 bghira

Could it be a leak from PyArrow which is used to stream the data from the Parquet files ?

lhoestq avatar Sep 02 '25 12:09 lhoestq

i believe it's heavily involved yeah

bghira avatar Sep 02 '25 12:09 bghira

Is there any update on this? I’m seeing the same issue when using multiple streaming datasets for long runs.

meetdoshi90 avatar Dec 09 '25 16:12 meetdoshi90

see https://github.com/bghira/webshart for a Rust-based implementation (relatively bare minimum for my needs, sorry - open to PRs) that helped me to workaround the problem.

bghira avatar Dec 09 '25 16:12 bghira

Thank you āœŒšŸ»

meetdoshi90 avatar Dec 09 '25 17:12 meetdoshi90

Btw since a Dataset uses memory mapped Arrow files, iterating on the data does iterate on memory mapped files which loads pages in RAM progressively but discards them once they are not needed anymore. Generally, the OS does it automatically when memory is needed for something else. So you might see your RSS (memory) go up but it won't OOM since data are paged out automatically by the OS.

lhoestq avatar Dec 09 '25 18:12 lhoestq

Hi @lhoestq It does go OOM since most of the memory occupied does not free up. Eg: For my case, the compressed data size is about 400 GB with over 100 datasets. Still, the occupied memory goes up to over a TB for longer runs. I think it's most likely related to the dataloader buffer; every time it is refreshed, it fails to free up the occupied memory by the previous buffer. MAX MEM: 908 Gbytes; AVG MEM: 843 Gbytes;

meetdoshi90 avatar Dec 09 '25 18:12 meetdoshi90