DeepSeek-V2
DeepSeek-V2 copied to clipboard
How to deploy in VLLM?
Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.
Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.
Can it support quantitative deployment? GPTQ or AWQ?
hi, we have support vllm in this pr(https://github.com/vllm-project/vllm/pull/4650)
hi, we have support vllm in this pr(vllm-project/vllm#4650)
Thank you for your great work. According to your document description: the actual deployment on an 8*H800 machine has an input throughput of more than 100,000 tokens/s and an output throughput of more than 50,000 tokens/s . Can we achieve such excellent performance with this vllm?
Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.
Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.
8x80G,8*40G only work for 4bit model
Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.
8x80G,8*40G only work for 4bit model
got it, thank you~
Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.
8x80G,8*40G only work for 4bit model
4bit model ? we don't get it
hi, we have support vllm in this pr(vllm-project/vllm#4650)
I failed to use vllm 0.4.2 for inference and reported the following error:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in
hi, we have support vllm in this pr(vllm-project/vllm#4650)
I failed to use vllm 0.4.2 for inference and reported the following error:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in init self.llm_engine = LLMEngine.from_engine_args( File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args engine_config = engine_args.create_engine_config() File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 544, in create_engine_config speculative_config = SpeculativeConfig.maybe_create_spec_config( TypeError: SpeculativeConfig.maybe_create_spec_config() missing 1 required positional argument: 'speculative_disable_by_batch_size'
Same problem
Solved by checking the engine/arg_utils.py
file.
@BasicCoder do you implement such speed?