datasets Memory leak when streaming

Describe the bug

I try to use a dataset with streaming=True, the issue I have is that the RAM usage becomes higher and higher until it is no longer sustainable.

I understand that huggingface store data in ram during the streaming, and more worker in dataloader there are, more a lot of shard will be stored in ram, but the issue I have is that the ram usage is not constant. So after each new shard loaded, the ram usage will be higher and higher.

Steps to reproduce the bug

You can run this code and see you ram usage, after each shard of 255 examples, your ram usage will be extended.

from datasets import load_dataset
from torch.utils.data import DataLoader

dataset = load_dataset("WaveGenAI/dataset", streaming=True)

dataloader = DataLoader(dataset["train"], num_workers=3)

for i, data in enumerate(dataloader):
    print(i, end="\r")

Expected behavior

The Ram usage should be always the same (just 3 shards loaded in the ram).

Environment info

datasets version: 3.0.1
Platform: Linux-6.10.5-arch1-1-x86_64-with-glibc2.40
Python version: 3.12.4
huggingface_hub version: 0.26.0
PyArrow version: 17.0.0
Pandas version: 2.2.3
fsspec version: 2024.6.1

Oct 31 '24 13:10 Jourdelune

I seem to have encountered the same problem when loading non streaming datasets. load_from_disk. Causing hundreds of GB of memory, but the dataset actually only has 50GB

Nov 07 '24 15:11 enze5088

FYI when streaming parquet data, only one row group per worker is loaded in memory at a time.

Btw for datasets of embeddings you can surely optimize your RAM by reading the data as torch tensors directly instead of the default python lists

from datasets import load_dataset
from torch.utils.data import DataLoader

dataset = load_dataset("WaveGenAI/dataset", streaming=True).with_format("torch")

dataloader = DataLoader(dataset["train"], num_workers=3)

Nov 18 '24 11:11 lhoestq

Im also, hitting this issue.....

 # This is what's causing the leak:
  batch_datasets = []
  for file_path in batch_files:
      dataset = load_dataset(..., streaming=True)
      shuffled_dataset = dataset.shuffle(seed=42, buffer_size=1000)  # 1000-item buffer
      batch_datasets.append(shuffled_dataset)  # Buffer persists

  interleaved_dataset = interleave_datasets(batch_datasets, seed=42)

And, nothing helps

      del batch_datasets, interleaved_dataset
      gc.collect()  # This doesn't work for HuggingFace internal memory structures

so my guess is that they wrote this in RUST and forgot to clean up!!!

Now, if i remove the interleaving and process files sequentially... like this it still leaks


     # Process files one by one - no batching, no interleaving
     for file_idx, file_path in enumerate(file_paths):
         dataset = load_dataset("parquet", data_files=file_path, split="train", streaming=True)
         shuffled_dataset = dataset.shuffle(seed=42, buffer_size=1000) 
         
         for record in shuffled_dataset:
             # Process record immediately
             pass
         
         del dataset, shuffled_dataset
         gc.collect()

File 1: 42.4% memory
File 2: 42.5% memory
File 3: 42.5% memory
File 4: 48.4% memory (+6%)
File 5: 52.7% memory (+4.3%)
File 6: 56.7% memory (+4%)
File 7: 59.6% memory (+2.9%)
File 8: 62.0% memory (+2.4%)

I had to go back to sequential shuffling (NO Interleaving) and clean up like this

     dataset.cleanup_cache_files()                                                         
         del dataset, shuffled_dataset                                                         
         gc.collect()                                                                          
         pa.default_memory_pool().release_unused()                                             
         libc.malloc_trim(0)  # when available

Aug 05 '25 11:08 SuperSonnix71

i have also observed these memory leaks inside the huggingface library when developing bghira/captionflow and had the same outcome of being unable to actually free anything when it occurs. i've worked around it by avoiding some of the more damaging parts of the library, but in doing so i've essentially restricted the compatibility levels of the project.

Aug 29 '25 18:08 bghira

Could it be a leak from PyArrow which is used to stream the data from the Parquet files ?

Sep 02 '25 12:09 lhoestq

i believe it's heavily involved yeah

Sep 02 '25 12:09 bghira

Is there any update on this? I’m seeing the same issue when using multiple streaming datasets for long runs.

Dec 09 '25 16:12 meetdoshi90

see https://github.com/bghira/webshart for a Rust-based implementation (relatively bare minimum for my needs, sorry - open to PRs) that helped me to workaround the problem.

Dec 09 '25 16:12 bghira

Thank you ✌🏻

Dec 09 '25 17:12 meetdoshi90

Btw since a Dataset uses memory mapped Arrow files, iterating on the data does iterate on memory mapped files which loads pages in RAM progressively but discards them once they are not needed anymore. Generally, the OS does it automatically when memory is needed for something else. So you might see your RSS (memory) go up but it won't OOM since data are paged out automatically by the OS.

Dec 09 '25 18:12 lhoestq

Hi @lhoestq It does go OOM since most of the memory occupied does not free up. Eg: For my case, the compressed data size is about 400 GB with over 100 datasets. Still, the occupied memory goes up to over a TB for longer runs. I think it's most likely related to the dataloader buffer; every time it is refreshed, it fails to free up the occupied memory by the previous buffer. MAX MEM: 908 Gbytes; AVG MEM: 843 Gbytes;

Dec 09 '25 18:12 meetdoshi90