FastChat
FastChat copied to clipboard
After multiple model workers start working concurrently for the first time, requests will only be received by one of the workers.
I am using a server.controller to control 3 model_workers, which are placed on 3 GPUs, and then I opened 3 identical server.gradio_web_server and input the same question. The first time, all 3 gradio_web_server can output content at the same time. But when all the outputs are finished, the second time I send requests to all three gradio_web_server simultaneously, only one model_worker works (i.e., only one gradio_web_server has a streaming output), and when I check the GPU utilization, only one GPU is being used. Can anyone tell me what the reason for this is? Is there anyone who has the same question?
我使用一个server.controller控制了3个model_worker,分别放置在3张GPU上,然后打开了3个相同的server.gradio_web_server,输入同一个问题,第一次,这3个gradio_web_server能同时输出内容,等到全部输出完毕后,第二次同时向这三个gradio_web_server发送请求,只会有一个model_worker工作(即只有一个gradio_web_server有流式输出),查看显卡利用率也仅仅只有一块GPU被使用,请问这是什么原因呢? 有任何朋友跟我有一样的疑问吗?