DeepSeek-V2 icon indicating copy to clipboard operation
DeepSeek-V2 copied to clipboard

Reproduce inference benchmark mentioned in the paper

Open zhouheyun opened this issue 9 months ago • 4 comments

I have a few questions about the inference efficiency of deepseek v2 1.

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8.

Are all the storage and computation performed in fp8 ? Does this harm the performance of the model? 2.

On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second.

Is this throughput achieved using testing request of 128K context length? Can we reproduce it using https://github.com/vllm-project/vllm/pull/4650

zhouheyun avatar May 11 '24 08:05 zhouheyun

Our open-source code (https://github.com/vllm-project/vllm/pull/4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun

luofuli avatar May 14 '24 05:05 luofuli

Our open-source code (vllm-project/vllm#4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun

What‘s the average inference context length to achieve the claimed throughput in the paper? @luofuli

zhouheyun avatar May 14 '24 06:05 zhouheyun

32K context length @zhouheyun

luofuli avatar May 27 '24 11:05 luofuli

vllm-project/vllm#4650

How much tokens/s this open source version can achieve? on 8*H800 ?

ArtificialZeng avatar Aug 15 '24 03:08 ArtificialZeng