vllm
vllm copied to clipboard
[Feature]: AssertionError: Speculative decoding not yet supported for RayGPU backend.
🚀 The feature, motivation and pitch
Hi,
Do you guys have any workaround for the Speculative decoding not yet supported for RayGPU backend.
error or idea when the RayGPU backend will support speculative decoding?
I run vllm server with the following command:
python3 -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model casperhansen/mixtral-instruct-awq \
--tensor-parallel-size 4 \
--enforce-eager \
--quantization awq \
--gpu-memory-utilization 0.96 \
--kv-cache-dtype fp8 \
--speculative-model mistralai/Mistral-7B-Instruct-v0.2 \
--num-speculative-tokens 3 \
--use-v2-block-manager \
--num-lookahead-slots 5
However, I got AssertionError: Speculative decoding not yet supported for RayGPU backend.
Alternatives
No response
Additional context
No response
I am having the same issue
python -m vllm.entrypoints.openai.api_server --model /home/llama3_70B_awq --port 8000 --tensor-parallel-size 2 --gpu-memory-utilization 0.95 --kv-cache-dtype fp8 --max-num-seqs 32 --speculative-model /home/llama3_8B_gptq --num-speculative-tokens 3 --use-v2-block-manager
running into this as well
Running into this as well
Running into this as well
Running into this as well
This issue should have been resolved by #4840