ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: PDF parsing leads to the task being killed

Open liugddx opened this issue 7 months ago • 8 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

RAGFlow image version

ragflow:v0.17.2-slim

Other environment information

64GB 32C
CentOS 7

Actual behavior

The PDF parsing caused the task to be killed. The error message is:

./entrypoint.sh: line 135: 47 Killed LD_PRELOAD="$JEMALLOC_PATH" "$PY" rag/svr/task_executor.py "${host_id}_${consumer_id}"

MEM_LIMIT=32GB

Expected behavior

No response

Steps to reproduce

Concurrent PDF Parsing.

Additional information

No response

liugddx avatar May 12 '25 02:05 liugddx

Please feel free to let me know if you have any questions

liugddx avatar May 12 '25 02:05 liugddx

The issue with PDF parsing in RAGFlow version ragflow:v0.17.2-slim causing the task to be killed is likely due to insufficient RAM. The error message related to entrypoint.sh and task_executor.py suggests that the parsing process is being terminated because it exceeds the available memory. To resolve this, you should increase your memory allocation by adjusting the MEM_LIMIT value in docker/.env and then restart your RAGFlow server for the changes to take effect [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar May 12 '25 02:05 dosubot[bot]

I guess it's about memory shortage. But it will restart itself and redo faild tasks.

KevinHuSh avatar May 12 '25 02:05 KevinHuSh

I guess it's about memory shortage. But it will restart itself and redo faild tasks.

How to reduce concurrent parsing when parsing more than 1w documents at the same time in one batch? Set MAX_CONCURRENT_TASKS=1, MAX_CONCURRENT_CHUNK_BUILDERS=1.

liugddx avatar May 12 '25 03:05 liugddx

I guess it's about memory shortage. But it will restart itself and redo faild tasks.

How to reduce concurrent parsing when parsing more than 1w documents at the same time in one batch? Set MAX_CONCURRENT_TASKS=1, MAX_CONCURRENT_CHUNK_BUILDERS=1.

Just tried and seems not working with these params. :(

S0b3Rr avatar May 14 '25 09:05 S0b3Rr

Hi all, after a digging into the source of task_executor.py, I think that there is an existing issue with the use of trio. I will soon finish my expriement and submit a fix pr.

S0b3Rr avatar May 17 '25 14:05 S0b3Rr

@KevinHuSh @liugddx ###Hi, I've created the pr(#7700) which will help fix this issue.

@liugddx

To prevent further reproduce of the same issue, you should consider:

  • Set vm.max_map_count >= 262144 on your server and,
  • Make sure you have enough RAM on your server. Also, remember that it will cost you more RAM when using DeepDoc.

S0b3Rr avatar May 17 '25 15:05 S0b3Rr

Indeed, even without a fix I confirm that vm.max_map_count helps on v18. I couldn't verify on nightly because of weird citations error on search/chat.

I don't understand how to set this parameter under WSL 2 on Windows though, README better mention it.

executed avatar May 20 '25 13:05 executed

I noticed that according to the docker-compose files, MEM_LIMIT only applies to the search engine container. Is that intended? Parsing processes also consume a lot of memory, but MEM_LIMIT doesn't seem to affect them.

afedotov89 avatar Jun 24 '25 10:06 afedotov89