DeepSeek-V2
DeepSeek-V2 copied to clipboard
Reproduce inference benchmark mentioned in the paper
I have a few questions about the inference efficiency of deepseek v2 1.
In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8.
Are all the storage and computation performed in fp8 ? Does this harm the performance of the model? 2.
On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second.
Is this throughput achieved using testing request of 128K context length? Can we reproduce it using https://github.com/vllm-project/vllm/pull/4650
Our open-source code (https://github.com/vllm-project/vllm/pull/4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun
Our open-source code (vllm-project/vllm#4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun
What‘s the average inference context length to achieve the claimed throughput in the paper? @luofuli
32K context length @zhouheyun
vllm-project/vllm#4650
How much tokens/s this open source version can achieve? on 8*H800 ?