ragflow
ragflow copied to clipboard
[Question]: How do I respond slowly to concurrent requests for interfaces /api/v1/chats/{chat_id}/completions?
Describe your problem
- The response time is about 50s when there is only one request
- When there are 10 concurrent requests, the last response time is 3min40s
- Is this because of the ragflow service itself or because the LLM is not friendly to concurrent requests?
You could click the little lamp using UI to check the time elapsed.
I checked. Mostly generating answers,It is definitely LLM problem. I used ollama to run the deepseek-r1:70b model, 8*4090 (24G) GPU, and the utilization rate of each GPU was less than 20%. I went to the ollama community and saw people raising similar problems, but there seemed to be no good solution. Do you have a good idea on how to increase usage with multiple Gpus? @KevinHuSh
No clue yet.