[Bug]: vllm 0.6.3 generates incomplete/repeated answers for long length (over 8k) input
Your current environment
vllm 0.6.3
Model Input Dumps
The input is long context with over 8k tokens
🐛 Describe the bug
- vllm 0.6.2 does not have this bug.
- We are running vllm 0.6.3 with speculative decoding. When we input long context (over 8k) into the model, the output is truncated and gives incomplete answers. The command we are using is
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /home/downloaded_model/Llama-3.2-3B-Instruct/ --speculative_model /home/downloaded_model/Llama-3.2-1B-Instruct/ --served-model-name LLM --tensor-parallel-size 8 --max-model-len 34336 --max-num-seqs 128 --enable-prefix-caching --disable-log-requests --use-v2-block-manager --seed 42 --num_speculative_tokens 5 --gpu_memory_utilization 0.95 --spec-decoding-acceptance-method typical_acceptance_sampler
- We then run vllm 0.6.3 without speculative decoding, but we still get incomplete answers or repeated answers. The command we use is
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /home/downloaded_model/Llama-3.2-3B-Instruct/ --served-model-name LLM --tensor-parallel-size 8 --max-model-len 34336 --max-num-seqs 128 --enable-prefix-caching --enable_chunked_prefill --disable-log-requests --seed 42 --gpu_memory_utilization 0.95
- How we call vllm model is as below
def call_vllm_api(message_log):
vllm_client = openai.OpenAI(api_key=API_KEY, base_url=BASE_URL)
response = vllm_client.chat.completions.create(
model="LLM",
messages=message_log,
max_tokens=4096,
temperature=0.2,
presence_penalty=0,
frequency_penalty=0,
)
response_content = response.choices[0].message.content
return response_content
As mentioned in this issue #9417, it works with --enforce-eager
+1
As mentioned in this issue #9417, it works with
--enforce-eager
I'm having the same issue. Running with --enforce-eager fixes this issue for now.
Try --disable-frontend-multiprocessing, that works much better for me and will still give you cuda graph speed boosts.
As mentioned in this issue #9417, it works with
--enforce-eagerI'm having the same issue. Running with
--enforce-eagerfixes this issue for now.
Awesome! It's work for me with 3.1 128k FP8. TNX!
Closing as fixed by https://github.com/vllm-project/vllm/pull/9549