vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: AssertionError: Speculative decoding not yet supported for RayGPU backend.

Open cocoza4 opened this issue 10 months ago • 1 comments

🚀 The feature, motivation and pitch

Hi,

Do you guys have any workaround for the Speculative decoding not yet supported for RayGPU backend. error or idea when the RayGPU backend will support speculative decoding?

I run vllm server with the following command:

python3 -u -m vllm.entrypoints.openai.api_server \
       --host 0.0.0.0 \
       --model casperhansen/mixtral-instruct-awq \
       --tensor-parallel-size 4 \
       --enforce-eager \
       --quantization awq \
       --gpu-memory-utilization 0.96 \
       --kv-cache-dtype fp8 \
       --speculative-model mistralai/Mistral-7B-Instruct-v0.2 \
       --num-speculative-tokens 3 \
       --use-v2-block-manager \
       --num-lookahead-slots 5

However, I got AssertionError: Speculative decoding not yet supported for RayGPU backend.

Alternatives

No response

Additional context

No response

cocoza4 avatar Apr 25 '24 07:04 cocoza4

I am having the same issue

python -m vllm.entrypoints.openai.api_server --model /home/llama3_70B_awq --port 8000 --tensor-parallel-size 2 --gpu-memory-utilization 0.95 --kv-cache-dtype fp8 --max-num-seqs 32 --speculative-model /home/llama3_8B_gptq --num-speculative-tokens 3 --use-v2-block-manager

psych0v0yager avatar May 04 '24 01:05 psych0v0yager

running into this as well

jamestwhedbee avatar May 07 '24 16:05 jamestwhedbee

Running into this as well

bkchang avatar May 10 '24 00:05 bkchang

Running into this as well

YuCheng-Qi avatar May 10 '24 07:05 YuCheng-Qi

Running into this as well

MRKINKI avatar May 20 '24 12:05 MRKINKI

This issue should have been resolved by #4840

bkchang avatar May 20 '24 15:05 bkchang