FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

After multiple model workers start working concurrently for the first time, requests will only be received by one of the workers.

Open PaulX1029 opened this issue 1 year ago • 3 comments
trafficstars

I am using a server.controller to control 3 model_workers, which are placed on 3 GPUs, and then I opened 3 identical server.gradio_web_server and input the same question. The first time, all 3 gradio_web_server can output content at the same time. But when all the outputs are finished, the second time I send requests to all three gradio_web_server simultaneously, only one model_worker works (i.e., only one gradio_web_server has a streaming output), and when I check the GPU utilization, only one GPU is being used. Can anyone tell me what the reason for this is? Is there anyone who has the same question?

我使用一个server.controller控制了3个model_worker,分别放置在3张GPU上,然后打开了3个相同的server.gradio_web_server,输入同一个问题,第一次,这3个gradio_web_server能同时输出内容,等到全部输出完毕后,第二次同时向这三个gradio_web_server发送请求,只会有一个model_worker工作(即只有一个gradio_web_server有流式输出),查看显卡利用率也仅仅只有一块GPU被使用,请问这是什么原因呢? 有任何朋友跟我有一样的疑问吗?

PaulX1029 avatar Aug 19 '24 09:08 PaulX1029