vllm API server abort all request for no reason

Bug Description

After running and test vllm successfully with NousResearch/Llama-2-7b-chat-hf and TheBloke/Llama-2-7b-Chat-AWQ, I change llm to vilm/vinallama-2.7b-chat - a llama-2 family model. This time the API server still run successfully but abort any received request and did not raise any error.

As first, I look through issue #546 and decided to quantize the model (using auto AWQ). However the quantized model still got the same issue. It couldn't be #633 or #273 case because the prompt input length is only 49 tokens and #677 won't be the case too because it failed from the first request.

I wonder if model train with bfloat16 datatype is the cause for this issue because vllm still work excellent with Llama-2-7b-Chat but not vinallama-2.7b-chat.

Update: I tried command from @viktor-ferenczi recommend solution in #1206 but the issue still remain.

Build script

python -m vllm.entrypoints.openai.api_server --model="vilm/vinallama-2.7b-chat" --port 6060 --dtype float16

Error / output

INFO 12-28 08:01:46 async_llm_engine.py:379] Received request cmpl-961f3a1448c94dd7be382448dc325d21: prompt: '[censored]', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], ignore_eos=False, max_tokens=4045, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [1, 1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 39834, 18916, 32126, 32237, 32498, 32370, 29892, 32851, 32973, 32895, 18916, 32269, 32557, 32237, 29889, 35088, 32207, 33865, 33788, 34781, 32215, 32073, 32550, 32529, 29889, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 37979, 32529, 32532, 33960, 518, 29914, 25580, 29962]. INFO 12-28 08:01:46 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0% INFO 12-28 08:01:51 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 81.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0% INFO 12-28 08:01:56 async_llm_engine.py:134] Aborted request cmpl-961f3a1448c94dd7be382448dc325d21.

Target platform

Ubuntu 20.04 - CUDA Version: 11.8 - GPU Tesla T4 - Python 3.8.17

Packages information:

protobuf==4.25.1
torch==2.1.1+cu118
vllm==0.2.4+cu118
autoawq==0.1.7

Dec 28 '23 09:12 ann-lab52

Bug Description

After running and test vllm successfully with NousResearch/Llama-2-7b-chat-hf and TheBloke/Llama-2-7b-Chat-AWQ, I change llm to vilm/vinallama-2.7b-chat - a llama-2 family model. This time the API server still run successfully but abort any received request and did not raise any error.

As first, I look through issue #546 and decided to quantize the model (using auto AWQ). However the quantized model still got the same issue. It couldn't be #633 or #273 case because the prompt input length is only 49 tokens and #677 won't be the case too because it failed from the first request.

I wonder if model train with bfloat16 datatype is the cause for this issue because vllm still work excellent with Llama-2-7b-Chat but not vinallama-2.7b-chat.

Update: I tried command from @viktor-ferenczi recommend solution in #1206 but the issue still remain.

Build script

python -m vllm.entrypoints.openai.api_server --model="vilm/vinallama-2.7b-chat" --port 6060 --dtype float16

Error / output

INFO 12-28 08:01:46 async_llm_engine.py:379] Received request cmpl-961f3a1448c94dd7be382448dc325d21: prompt: '[censored]', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], ignore_eos=False, max_tokens=4045, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [1, 1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 39834, 18916, 32126, 32237, 32498, 32370, 29892, 32851, 32973, 32895, 18916, 32269, 32557, 32237, 29889, 35088, 32207, 33865, 33788, 34781, 32215, 32073, 32550, 32529, 29889, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 37979, 32529, 32532, 33960, 518, 29914, 25580, 29962]. INFO 12-28 08:01:46 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0% INFO 12-28 08:01:51 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 81.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0% INFO 12-28 08:01:56 async_llm_engine.py:134] Aborted request cmpl-961f3a1448c94dd7be382448dc325d21.

Target platform

Ubuntu 20.04 - CUDA Version: 11.8 - GPU Tesla T4 - Python 3.8.17

Packages information:

protobuf==4.25.1

torch==2.1.1+cu118

vllm==0.2.4+cu118

autoawq==0.1.7

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Jan 04 '24 07:01 chi2liu

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

@chi2liu Thank you for your reply. I did try to set the swap space directly to 0 as you suggested by adding --swap-space 0 in the command line but the issue still remain. Can you describe your approach in detail?

Jan 08 '24 04:01 ann-lab52

It's been a while and I finally found the cause of this issue. Some fine-tuned model don't include parameter chat_template in file tokenizer_config.json which is make the model can not generate respond probably to v1/chat request input, hence vllm logging 0.0 tokens/s continuously. For solution, please contact the model's author to add chat_template into file tokenizer_config.json (you can try some default chat template recommend by huggingface). I also recommend running a separate inference file instead using vllm-cli directly for better debugging.

I will close this issue then. Please feel free to reopen it if you need. Thanks.

Mar 04 '24 05:03 ann-lab52