[Question]: How do I respond slowly to concurrent requests for interfaces /api/v1/chats/{chat_id}/completions?

Open xyk0930 opened this issue 10 months ago • 3 comments

Describe your problem

The response time is about 50s when there is only one request
When there are 10 concurrent requests, the last response time is 3min40s
Is this because of the ragflow service itself or because the LLM is not friendly to concurrent requests?

Feb 20 '25 07:02 xyk0930

You could click the little lamp using UI to check the time elapsed.

Feb 21 '25 03:02 KevinHuSh

I checked. Mostly generating answers,It is definitely LLM problem. I used ollama to run the deepseek-r1:70b model, 8*4090 (24G) GPU, and the utilization rate of each GPU was less than 20%. I went to the ollama community and saw people raising similar problems, but there seemed to be no good solution. Do you have a good idea on how to increase usage with multiple Gpus? @KevinHuSh

Feb 21 '25 07:02 xyk0930

No clue yet.

Feb 21 '25 11:02 KevinHuSh