vllm [Usage]: why speculate decoding is slower than normal decoding？

[Usage]: why speculate decoding is slower than normal decoding？

Open yunll opened this issue 5 months ago • 3 comments

Your current environment

The startup command is as follows: it initiates both a standard 7B model and an n-gram speculate model. Speed tests discover that the speculate model performs more slowly."

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 9000 --model Qwen2-7B-Instruct -tp 1 --gpu_memory_utilization 0.9

CUDA_VISIBLE_DEVICES=3 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 9002 --model Qwen2-7B-Instruct -tp 1 --speculative_model [gram] --use-v2-block-manager --num_speculative_tokens 5 --ngram-prompt-lookup-max 4 --gpu_memory_utilization 0.9

result
7b:
first token:  0.04074668884277344s
decode time:  14.328832149505615s
output token:  1000
decode speed:  69.78935823702163 token/s

spec 7b
first token:  0.02350592613220215s
decode time:  15.324904918670654s
output token:  947
decode speed:  61.794836902788866 token/s

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Sep 13 '24 03:09 yunll

vllm vllm copied to clipboard

[Usage]: why speculate decoding is slower than normal decoding？

Your current environment

How would you like to use vllm

Before submitting a new issue...

vllm
vllm copied to clipboard