vllm
vllm copied to clipboard
[Usage]: why speculate decoding is slower than normal decoding?
Your current environment
The startup command is as follows: it initiates both a standard 7B model and an n-gram speculate model. Speed tests discover that the speculate model performs more slowly."
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 9000 --model Qwen2-7B-Instruct -tp 1 --gpu_memory_utilization 0.9
CUDA_VISIBLE_DEVICES=3 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 9002 --model Qwen2-7B-Instruct -tp 1 --speculative_model [gram] --use-v2-block-manager --num_speculative_tokens 5 --ngram-prompt-lookup-max 4 --gpu_memory_utilization 0.9
result
7b:
first token: 0.04074668884277344s
decode time: 14.328832149505615s
output token: 1000
decode speed: 69.78935823702163 token/s
spec 7b
first token: 0.02350592613220215s
decode time: 15.324904918670654s
output token: 947
decode speed: 61.794836902788866 token/s
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.