[Bug]: PDF parsing leads to the task being killed
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
RAGFlow image version
ragflow:v0.17.2-slim
Other environment information
64GB 32C
CentOS 7
Actual behavior
The PDF parsing caused the task to be killed. The error message is:
./entrypoint.sh: line 135: 47 Killed LD_PRELOAD="$JEMALLOC_PATH" "$PY" rag/svr/task_executor.py "${host_id}_${consumer_id}"
MEM_LIMIT=32GB
Expected behavior
No response
Steps to reproduce
Concurrent PDF Parsing.
Additional information
No response
Please feel free to let me know if you have any questions
The issue with PDF parsing in RAGFlow version ragflow:v0.17.2-slim causing the task to be killed is likely due to insufficient RAM. The error message related to entrypoint.sh and task_executor.py suggests that the parsing process is being terminated because it exceeds the available memory. To resolve this, you should increase your memory allocation by adjusting the MEM_LIMIT value in docker/.env and then restart your RAGFlow server for the changes to take effect [1].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
I guess it's about memory shortage. But it will restart itself and redo faild tasks.
I guess it's about memory shortage. But it will restart itself and redo faild tasks.
How to reduce concurrent parsing when parsing more than 1w documents at the same time in one batch? Set MAX_CONCURRENT_TASKS=1, MAX_CONCURRENT_CHUNK_BUILDERS=1.
I guess it's about memory shortage. But it will restart itself and redo faild tasks.
How to reduce concurrent parsing when parsing more than 1w documents at the same time in one batch? Set MAX_CONCURRENT_TASKS=1, MAX_CONCURRENT_CHUNK_BUILDERS=1.
Just tried and seems not working with these params. :(
Hi all, after a digging into the source of task_executor.py, I think that there is an existing issue with the use of trio. I will soon finish my expriement and submit a fix pr.
@KevinHuSh @liugddx ###Hi, I've created the pr(#7700) which will help fix this issue.
@liugddx
To prevent further reproduce of the same issue, you should consider:
- Set vm.max_map_count >= 262144 on your server and,
- Make sure you have enough RAM on your server. Also, remember that it will cost you more RAM when using DeepDoc.
Indeed, even without a fix I confirm that vm.max_map_count helps on v18. I couldn't verify on nightly because of weird citations error on search/chat.
I don't understand how to set this parameter under WSL 2 on Windows though, README better mention it.
I noticed that according to the docker-compose files, MEM_LIMIT only applies to the search engine container. Is that intended? Parsing processes also consume a lot of memory, but MEM_LIMIT doesn't seem to affect them.