ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: How can I improve the concurrency performance of the ragflow_stream_output api?

Open Weishaoya opened this issue 1 year ago • 4 comments

Describe your problem

image image image For ragflow_streaming_output api, when I set the number of concurrent requests to 1, 10, and 100, the first token latency was 0.6719s, 4.7593s, and 41.9158s, respectively. Due to the existence of the retrieval link, the concurrency performance of ragflow_stream_output api is weak, which is not conducive to large-scale applications. How can I improve the concurrency performance of ragflow_stream_output api?

Weishaoya avatar Nov 25 '24 16:11 Weishaoya

Change the run_simple function in api/ragflow_server.py.

KevinHuSh avatar Nov 26 '24 01:11 KevinHuSh

Change the run_simple function in api/ragflow_server.py. image image image I have changed the run_simple function to the Gunicorn. The concurrency performance of ragflow_stream_output improves when I set workers to 10, but it has a problem that the embedding model will be loaded 10 times on gpu-0. Can you provide a better way to improve the concurrency performance? Thank you!

Weishaoya avatar Nov 26 '24 16:11 Weishaoya

修改api/ragflow_server.py中的run_simple函数, 我已经将run_simple函数改成Gunicorn了,当我将workers设置为10时,ra​​gflow_stream_output的并发性能有所提升,但是存在一个问题,即在gpu-0上embedding模型会被加载10次,能否提供更好的方法来提高并发性能?谢谢! image image image

How did you test it, can you give a code, thanks

Mr-greenplus avatar Mar 10 '25 01:03 Mr-greenplus

Change the run_simple function in api/ragflow_server.py. image image image I have changed the run_simple function to the Gunicorn. The concurrency performance of ragflow_stream_output improves when I set workers to 10, but it has a problem that the embedding model will be loaded 10 times on gpu-0. Can you provide a better way to improve the concurrency performance? Thank you!

We solved it by setting embedding model in another service with vllm. Maybe change the embedding model loading logic and take it out from app and run it as a singleton also work but apparently it takes more time to code XD

wizounovziki avatar May 21 '25 07:05 wizounovziki