FastChat
FastChat copied to clipboard
Slower throughput with openai_server
trafficstars
Hi,
I deployed the model backend using fastchat worker. I tested the throughput for llama3-8b model and it reached >2500 tokens/seconds on A100. However when I started using the same model through openai_server for openai compatible endpoint, the throughput reduced drastically to < 500 tokens/seconds. I understand that there would be some reduction but not this much. Has someone else faced the same problem. If yes, how did you optimize it.