DeepSpeed-MII
DeepSpeed-MII copied to clipboard
Benchmark:Performance is lower than vllm
Test environment 1A10080G | vllm==0.2.6+cu118 | deepspeed-mii==0.2.0 | Llama-2-7b-chat-hf script:https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/inference/mii
Test Result:
Why is the performance lower than vllm?
Hi @zhaotyer could you provide some additional information about how you collected these numbers? Are you running the benchmark in our DeepSpeedExamples repo?
If so, are you gathering these numbers directly from the resulting log files?
I just ran the Llama-2-7b model on 1xA6000 GPU with prompt size 256 and generation size 256 for 1, 2, 4, 8, 16, and 32 clients and I'm seeing roughly equal performance for vLLM and FastGen (DeepSpeed-MII):
This is expected for the current release. FastGen is capable of providing better performance with longer prompts and shorter generation lengths. We go into greater detail of the performance and benchmarks in the two FastGen release blogs here and here.