DeepSeek-V2 How to deploy in VLLM?

May 07 '24 03:05 ZHENG518

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.

May 07 '24 06:05 stack-heap-overflow

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.

Can it support quantitative deployment? GPTQ or AWQ？

May 07 '24 06:05 Xu-Chen

hi, we have support vllm in this pr(https://github.com/vllm-project/vllm/pull/4650)

May 07 '24 08:05 zwd003

hi, we have support vllm in this pr(vllm-project/vllm#4650)

Thank you for your great work. According to your document description: the actual deployment on an 8*H800 machine has an input throughput of more than 100,000 tokens/s and an output throughput of more than 50,000 tokens/s . Can we achieve such excellent performance with this vllm?

May 07 '24 09:05 BasicCoder

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

May 08 '24 15:05 Ricardokevins

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G，8*40G only work for 4bit model

May 09 '24 04:05 zwd003

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G，8*40G only work for 4bit model

got it, thank you~

May 09 '24 04:05 Ricardokevins

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G，8*40G only work for 4bit model

4bit model ？ we don't get it

May 09 '24 08:05 ccp123456789

hi, we have support vllm in this pr(vllm-project/vllm#4650)

I failed to use vllm 0.4.2 for inference and reported the following error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in init self.llm_engine = LLMEngine.from_engine_args( File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args engine_config = engine_args.create_engine_config() File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 544, in create_engine_config speculative_config = SpeculativeConfig.maybe_create_spec_config( TypeError: SpeculativeConfig.maybe_create_spec_config() missing 1 required positional argument: 'speculative_disable_by_batch_size'

May 11 '24 09:05 ZhangYaoFu

hi, we have support vllm in this pr(vllm-project/vllm#4650)

I failed to use vllm 0.4.2 for inference and reported the following error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in init self.llm_engine = LLMEngine.from_engine_args( File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args engine_config = engine_args.create_engine_config() File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 544, in create_engine_config speculative_config = SpeculativeConfig.maybe_create_spec_config( TypeError: SpeculativeConfig.maybe_create_spec_config() missing 1 required positional argument: 'speculative_disable_by_batch_size'

Same problem

Solved by checking the engine/arg_utils.py file.

May 29 '24 06:05 yukiwayx

@BasicCoder do you implement such speed？

Aug 11 '24 21:08 KylinMountain

DeepSeek-V2 DeepSeek-V2 copied to clipboard

How to deploy in VLLM?

DeepSeek-V2
DeepSeek-V2 copied to clipboard