DeepSpeed-MII
DeepSpeed-MII copied to clipboard
0.6 req /s is kinda low ,for real?
we have one A100 can support 2 requests ,throughoutput about 10 tokens /s,with just kv cache technique. your configuration with 4 *100 can achieve only 0.6 req /s under vllm seems way too low . find it hard to believe.
we have one A100 can support 2 requests ,throughoutput about 10 tokens /s,with just kv cache technique. your configuration with 4 *100 can achieve only 0.6 req /s under vllm seems way too low . find it hard to believe.
can you give more details about the model's architecture, size, and the way you used to benchmark it? and some more details about the environment
@chuangzhidan please try with the latest main
branch. I have made improvements that allow us to match performance the RESTful API to our Python API (see #328).