Generate nothing from VLLM output
When I run batch inferences, sometimes, the output from vLLM is empty, meaning prediction is empty. Could we make it at least it generate one token? The output is empty is also strange.
@FocusLiwen can you add some more detail like how you are using inferencing , sampling params & what is your request
Hi, I used tensor_parallel_size as 2 with seed=0 and the following parameters: "max_tokens": 128, "temperature": 0, "top_p": 1.0, "top_k": -1 I also extract the output from the generation call:
"gold": {"text": "Sports", "supplements": {}}, "predictions": [{"text": "", "raw_text": "", "logprob": 0, "tokens": []}]}
In Huggingface generation api, there is a parameter called: min_gen_len, which could be set as 1 to avoid zero output. But with vLLM, there is no such parameters.
Does this happen if you increase the temperature to 1e-3 or 1e-2?
In Huggingface generation api, there is a parameter called: min_gen_len, which could be set as 1 to avoid zero output. But with vLLM, there is no such parameters.
We ran into the same problem (no output). Being able to set a minimal length for the generated text will be very helpful.
The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).
The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).
same, have you solve this?
The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).
same, have you solve this?
This is the same for me as well. It says 1024 completion_tokens, but the content is blank. The dolphin version seems to work TheBloke/dolphin-2.6-mixtral-8x7b-AWQ.
The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).
same, have you solve this?
This is the same for me as well. It says 1024 completion_tokens, but the content is blank. The dolphin version seems to work TheBloke/dolphin-2.6-mixtral-8x7b-AWQ.
Seeing the same thing. Could it be a problem with the model itself? Wonder if TheBloke's GPTQ version works.
It is likely your weights are corrupted
The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).
same problem, has it been resolved?
Same issue there! using vllm==0.3.0+cu118 with TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ something is definitely wrong. It outputs an empty string despite calculation and gpu usage if anyone knows why
Same problem. Some of my generation outputs are empty with Mixtral-8x7B-Instruct
same here with mistral the output is empty
I ecountered the same issue, empty string text output, with the TheBloke Mixtral AWQ with vllm and 2 loaders from Oobaboogas Web UI as well. However, ybelkada/Mixtral-8x7B-Instruct-v0.1-AWQ worked on both vLLM and the WebUI for me.
I'm still not 100% sure its a faulty model, so I'd be happy If one of you can confirm (or deny) this with your setup.
In my case using Mistral Instruct model, input with proper template and setting max_tokens from SamplingParams helps.
Are there any news on this. I also get empty strings when running in batches returning the empty string for some reason. If using batch size of one this never happens. Any update on this would be great.
Same issue!
im facing the same issue with llama 3 8b on a 48 gb vram gpu while using the outlines library to enforce json responses, the fields are empty, even though there is plenty of ram left and the model is loaded on the gpu completely.
same problem, when the prompt is run for the first time it generates normally, if run again the same prompt returns an empty response. Model: phi3 medium
EDIT: I solve add min_tokens in SamplingParams
SamplingParams(temperature=0.5, min_tokens=1000)
I get a similar output. Tested on vllm 0.4.2 and vllm 0.4.3. If I try to use Mistral 7B Instruct v0.2, it generates blanks for a particular given input. However, if I change the model to Mistral 7B Instruct v0.3 or use another model like Llama3, this problem does not appear for me.
Raw output using streaming. Same problem happens without streaming except that I get no output because it is busy generating the blanks.
data: "\\n\\n| "
data: "Met"
data: "ric "
data: " "
data: " "
data: " | Descript"
data: "ion "
data: " "
data: " "
data: " "
data: " "
data: " "
...
The logs for the when the blanks are generated
sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>'], stop_token_ids=[2], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16384, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None.
INFO 06-07 04:31:54 metrics.py:334] Avg prompt throughput: 91.9 tokens/s, Avg generation throughput: 5.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 14.6%, CPU KV cache usage: 0.0%
INFO 06-07 04:31:59 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 37.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.8%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:04 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 37.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 16.9%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:09 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.0%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:14 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 19.2%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:20 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 20.3%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:25 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 21.4%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:30 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 22.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:35 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 23.6%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:40 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 24.8%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:45 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.8%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:50 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 26.9%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:55 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 28.0%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:00 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 29.1%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:05 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 30.1%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:10 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 31.2%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:15 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 32.4%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:20 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 33.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:25 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 34.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:30 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 35.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:35 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 36.6%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:40 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 37.7%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:45 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 38.7%, CPU KV cache usage: 0.0%
encountered the same issue
For some reason, setting min_tokens=1 did not work for me, but min_tokens=2 worked.
same issue here.
same issue here, changing min_tokens won't help my case.
Experienced this today, runnin a somewhat exotic GPTQ quant: ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ
Sample request
curl {{host}}/v1/chat/completions -H 'Content-Type: application/json' -H "Authorization: Bearer ---" -d '{
"model": "ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ",
"messages": [
{
"role": "user",
"content": "Answer in one word. Where is Paris?"
}
],
"min_tokens": 2,
"max_tokens": 8
}'
Sample response
{
"id": "chat-9e94785f8fd94d1d93be478cb96ddaea",
"object": "chat.completion",
"created": 1722955194,
"model": "ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 12,
"total_tokens": 20,
"completion_tokens": 8
}
}
Engine args
--cpu-offload-gb 26 --max-model-len 1024 --enforce-eager
I've tried many other combinations, however they didn't work either