datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Memory overhead in multiprocessing

Open jordane95 opened this issue 1 year ago • 8 comments

When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only and can be stored in a shared memory space across all processes.

Can we optimize the current code for memory saving? I find that using mp.manager can create shared memory and avoid memory copying. But I find it quite hard to integrate in the current code as the manager is initialized at the executor level, but not passed to each pipeline step.

jordane95 avatar Apr 24 '24 08:04 jordane95

Indeed there would maybe be some complications. I would be curious, however, to know what the performance (in terms of speed) implications of loading the model from shared memory would be, have you tested this?

guipenedo avatar Apr 24 '24 14:04 guipenedo

I have a question regarding memory overhead. I created and ran an executor designed to count tokens on approximately 2TB of text (jsonl), but it gets stuck every time I run it. According to the memory and CPU usage data, the memory usage fills up the 256GB I have available, and after getting stuck, the CPU usage drops from 99% to 0%.

The problem is that there are no error messages in the log, making it impossible to resolve the issue. Does anyone have any suggestions on how to address this? I suspect this might be a memory overhead issue.

justHungryMan avatar May 24 '24 09:05 justHungryMan

Hi, is your problem solved now? I also encountered some similar issues (unexpected OOM resulting in failed jobs). I also think the source of the unexpected OOM issues may be a few docs with long-context.

SinclairCoder avatar Aug 26 '24 12:08 SinclairCoder

Same issue also here. Applied a 20k word size limitation per document entries before that do solve most of it but still having a few oom (may also be due to the size of specific file for ingestion?). Would be nice to circumvent the issue as it fails silently…

Pclanglais avatar Aug 30 '24 19:08 Pclanglais

So I found a fix: best is to lower the tokenizer batches. 1000 runs fine (or even lower to use all available cpus on long text)

Actually in tokenizer.py 1000 was meant as the default value: batch_size (int): batch size for tokenization (default: 1000)

While we do have: batch_size: int = 10000, # batch size for tokenization

Maybe bringing back 1000 would be safer?

Pclanglais avatar Aug 31 '24 00:08 Pclanglais

My OOM case happened in the text extractor. But I do not know how to fix it. Sad.

SinclairCoder avatar Aug 31 '24 11:08 SinclairCoder

Reducing workers or batch_size temporarily fixes memory overflows, but the real issue is the module’s inability to detect these problems. Enhancements are needed for stable, efficient performance.

justHungryMan avatar Aug 31 '24 11:08 justHungryMan

Got it. Typically, I started n of tasks to process data (a pipeline could consist of the WARC reader, URL filter, Text Extractor, and Writer). However, several tasks even more (e.g., half numbers of tasks) failed due to OOM. I have to rerun the script to resume these tasks, which may require more time, and more memory. But I do not know what's the exact memory size, which could cause these tasks to fail again. I'm struggling with it.

Any suggestions are welcome!

cc @guipenedo

SinclairCoder avatar Aug 31 '24 12:08 SinclairCoder