DeepSpeed-MII 0.6 req /s is kinda low ,for real?

0.6 req /s is kinda low ,for real?

Open chuangzhidan opened this issue 7 months ago • 2 comments

we have one A100 can support 2 requests ,throughoutput about 10 tokens /s，with just kv cache technique. your configuration with 4 *100 can achieve only 0.6 req /s under vllm seems way too low . find it hard to believe.

Nov 22 '23 03:11 chuangzhidan

we have one A100 can support 2 requests ,throughoutput about 10 tokens /s，with just kv cache technique. your configuration with 4 *100 can achieve only 0.6 req /s under vllm seems way too low . find it hard to believe.

can you give more details about the model's architecture, size, and the way you used to benchmark it? and some more details about the environment

Nov 22 '23 06:11 PawanOsman

@chuangzhidan please try with the latest main branch. I have made improvements that allow us to match performance the RESTful API to our Python API (see #328).

Nov 28 '23 00:11 mrwyattii

DeepSpeed-MII DeepSpeed-MII copied to clipboard

0.6 req /s is kinda low ,for real?

DeepSpeed-MII
DeepSpeed-MII copied to clipboard