FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Slower throughput with openai_server

Open tacacs1101-debug opened this issue 1 year ago • 0 comments
trafficstars

Hi,

I deployed the model backend using fastchat worker. I tested the throughput for llama3-8b model and it reached >2500 tokens/seconds on A100. However when I started using the same model through openai_server for openai compatible endpoint, the throughput reduced drastically to < 500 tokens/seconds. I understand that there would be some reduction but not this much. Has someone else faced the same problem. If yes, how did you optimize it.

tacacs1101-debug avatar Aug 06 '24 14:08 tacacs1101-debug