ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: The documents parsing are complete, but the memory is not released.

Open danny-zhu opened this issue 1 year ago • 11 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

RAGFlow workspace code commit ID

0.14.1

RAGFlow image version

0.14.1

Other environment information

Ubuntu 24.04

Actual behavior

After uploading a PDF documents for parsing and embedding, the memory usage only increases without decreasing.

Expected behavior

No response

Steps to reproduce

Upload several large-volume PDF documents.

Additional information

No response

danny-zhu avatar Dec 13 '24 11:12 danny-zhu

Do you have an estimate on how much (k/M)bytes it "leaks" per parsed document?

Snify89 avatar Dec 13 '24 14:12 Snify89

What about using a SAAS embedding model or Ollama/Xinference served embedding model?

KevinHuSh avatar Dec 16 '24 01:12 KevinHuSh

Do you have an estimate on how much (k/M)bytes it "leaks" per parsed document?

The PDF documents I uploaded are approximately 40-50 MB each one. The process of embedding one PDF document costs around 40 GB of memory.

image

danny-zhu avatar Dec 16 '24 10:12 danny-zhu

What about using a SAAS embedding model or Ollama/Xinference served embedding model?

Memory is still not released.

danny-zhu avatar Dec 16 '24 10:12 danny-zhu

I noticed the same today, I set up ragflow in a cloud instance of 16GB of RAM and I noticed it was not enough for ingestion for ingestion ~15 pdf of 10-20 pages, the RAM was already around 20GB and not released. This is still the case with the master. When you reach the RAM, for some reason the program don't crash and the swap takes the relay and nothing respond anymore... This is a critical issue that will hinder the production's ragflow deployment.

ODAncona avatar Dec 20 '24 21:12 ODAncona

There is a method that can temporarily resolve the memory overflow caused by Ragflow during the document embedding process, but the memory leak is still inevitable. Set the environment variable TRACE_MALLOC_DELTA to 1, or modify it directly in the code like this: TRACE_MALLOC_DELTA = int(os.environ.get('TRACE_MALLOC_DELTA', "1")). image After making this setting, the memory will not increase indefinitely during the document embedding process, but the memory that has been used will still not be automatically released after the task document embedding is completed. image

danny-zhu avatar Dec 24 '24 07:12 danny-zhu

Any updates on this? thanks! The issue persists even when using external embedding model. So I guess it's within the OCR or other steps?

stan-wang-analycia avatar Jan 15 '25 02:01 stan-wang-analycia

I also encountered this problem. Parsing a 10-page PDF took up about 16G of memory, and then it became unresponsive and kept restarting the container.

TraceIvan avatar Jan 23 '25 09:01 TraceIvan

Could you test this with/without OCR layer embedded in the file? Is it PDF only? Does it also occur when using deepdoc (standalone) only?

Snify89 avatar Jan 23 '25 12:01 Snify89

same. the memory will not be released after embedding completion, even using embedding api services.

chminsc avatar Feb 14 '25 07:02 chminsc

Memory is still not released, help me , version 0.15.0

luongphambao avatar Feb 24 '25 04:02 luongphambao