vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: vllm 0.6.3 generates incomplete/repeated answers for long length (over 8k) input

Open tf-ninja opened this issue 1 year ago • 1 comments

Your current environment

vllm 0.6.3

Model Input Dumps

The input is long context with over 8k tokens

🐛 Describe the bug

  1. vllm 0.6.2 does not have this bug.
  2. We are running vllm 0.6.3 with speculative decoding. When we input long context (over 8k) into the model, the output is truncated and gives incomplete answers. The command we are using is
python -m vllm.entrypoints.openai.api_server  --host 0.0.0.0  --port 8083  --model /home/downloaded_model/Llama-3.2-3B-Instruct/  --speculative_model /home/downloaded_model/Llama-3.2-1B-Instruct/  --served-model-name  LLM  --tensor-parallel-size 8  --max-model-len 34336  --max-num-seqs 128  --enable-prefix-caching --disable-log-requests --use-v2-block-manager --seed 42 --num_speculative_tokens 5  --gpu_memory_utilization 0.95  --spec-decoding-acceptance-method typical_acceptance_sampler
  1. We then run vllm 0.6.3 without speculative decoding, but we still get incomplete answers or repeated answers. The command we use is
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /home/downloaded_model/Llama-3.2-3B-Instruct/ --served-model-name  LLM --tensor-parallel-size 8 --max-model-len 34336 --max-num-seqs 128 --enable-prefix-caching --enable_chunked_prefill --disable-log-requests --seed 42 --gpu_memory_utilization 0.95
  1. How we call vllm model is as below
def call_vllm_api(message_log):
    vllm_client = openai.OpenAI(api_key=API_KEY, base_url=BASE_URL)
    
    response = vllm_client.chat.completions.create(
        model="LLM",
        messages=message_log,
        max_tokens=4096,
        temperature=0.2,
        presence_penalty=0,
        frequency_penalty=0,
    )
    
    response_content = response.choices[0].message.content
    
    return response_content

tf-ninja avatar Oct 17 '24 05:10 tf-ninja

As mentioned in this issue #9417, it works with --enforce-eager

tf-ninja avatar Oct 18 '24 01:10 tf-ninja

+1

yudian0504 avatar Oct 21 '24 08:10 yudian0504

As mentioned in this issue #9417, it works with --enforce-eager

I'm having the same issue. Running with --enforce-eager fixes this issue for now.

Jason-CKY avatar Oct 21 '24 09:10 Jason-CKY

Try --disable-frontend-multiprocessing, that works much better for me and will still give you cuda graph speed boosts.

bbss avatar Oct 24 '24 14:10 bbss

As mentioned in this issue #9417, it works with --enforce-eager

I'm having the same issue. Running with --enforce-eager fixes this issue for now.

Awesome! It's work for me with 3.1 128k FP8. TNX!

TheAlexPG avatar Oct 28 '24 18:10 TheAlexPG

Closing as fixed by https://github.com/vllm-project/vllm/pull/9549

ywang96 avatar Nov 09 '24 00:11 ywang96