vllm Generate nothing from VLLM output

When I run batch inferences, sometimes, the output from vLLM is empty, meaning prediction is empty. Could we make it at least it generate one token? The output is empty is also strange.

Sep 26 '23 18:09 FocusLiwen

@FocusLiwen can you add some more detail like how you are using inferencing , sampling params & what is your request

Sep 26 '23 19:09 AnupKumarJha

Hi, I used tensor_parallel_size as 2 with seed=0 and the following parameters: "max_tokens": 128, "temperature": 0, "top_p": 1.0, "top_k": -1 I also extract the output from the generation call:

"gold": {"text": "Sports", "supplements": {}}, "predictions": [{"text": "", "raw_text": "", "logprob": 0, "tokens": []}]}

Sep 26 '23 20:09 FocusLiwen

In Huggingface generation api, there is a parameter called: min_gen_len, which could be set as 1 to avoid zero output. But with vLLM, there is no such parameters.

Sep 26 '23 20:09 FocusLiwen

Does this happen if you increase the temperature to 1e-3 or 1e-2?

Sep 28 '23 09:09 viktor-ferenczi

In Huggingface generation api, there is a parameter called: min_gen_len, which could be set as 1 to avoid zero output. But with vLLM, there is no such parameters.

We ran into the same problem (no output). Being able to set a minimal length for the generated text will be very helpful.

Oct 04 '23 19:10 summer66

The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).

Jan 02 '24 18:01 eaubin

The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).

same, have you solve this?

Jan 06 '24 04:01 raihan0824

The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).

same, have you solve this?

This is the same for me as well. It says 1024 completion_tokens, but the content is blank. The dolphin version seems to work TheBloke/dolphin-2.6-mixtral-8x7b-AWQ.

Jan 08 '24 22:01 Andrew-MAQ

The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).

same, have you solve this?

This is the same for me as well. It says 1024 completion_tokens, but the content is blank. The dolphin version seems to work TheBloke/dolphin-2.6-mixtral-8x7b-AWQ.

Seeing the same thing. Could it be a problem with the model itself? Wonder if TheBloke's GPTQ version works.

Jan 11 '24 22:01 hnhlester

It is likely your weights are corrupted

Jan 26 '24 00:01 sfc-gh-hazhang

The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).

same problem, has it been resolved?

Feb 02 '24 03:02 HIT-Owen

Same issue there! using vllm==0.3.0+cu118 with TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ something is definitely wrong. It outputs an empty string despite calculation and gpu usage if anyone knows why

Feb 06 '24 15:02 imadcap

Same problem. Some of my generation outputs are empty with Mixtral-8x7B-Instruct

Feb 16 '24 14:02 arifcraft

same here with mistral the output is empty

Mar 24 '24 08:03 hahmad2008

I ecountered the same issue, empty string text output, with the TheBloke Mixtral AWQ with vllm and 2 loaders from Oobaboogas Web UI as well. However, ybelkada/Mixtral-8x7B-Instruct-v0.1-AWQ worked on both vLLM and the WebUI for me.

I'm still not 100% sure its a faulty model, so I'd be happy If one of you can confirm (or deny) this with your setup.

Apr 01 '24 11:04 Meersalzeis

In my case using Mistral Instruct model, input with proper template and setting max_tokens from SamplingParams helps.

Apr 03 '24 08:04 cieske

Are there any news on this. I also get empty strings when running in batches returning the empty string for some reason. If using batch size of one this never happens. Any update on this would be great.

Apr 18 '24 18:04 sAviOr287

Same issue!

Apr 30 '24 18:04 AmoghM

im facing the same issue with llama 3 8b on a 48 gb vram gpu while using the outlines library to enforce json responses, the fields are empty, even though there is plenty of ram left and the model is loaded on the gpu completely.

May 16 '24 14:05 hugocool

same problem, when the prompt is run for the first time it generates normally, if run again the same prompt returns an empty response. Model: phi3 medium

EDIT: I solve add min_tokens in SamplingParams

SamplingParams(temperature=0.5, min_tokens=1000)

May 28 '24 13:05 joaograndotto

I get a similar output. Tested on vllm 0.4.2 and vllm 0.4.3. If I try to use Mistral 7B Instruct v0.2, it generates blanks for a particular given input. However, if I change the model to Mistral 7B Instruct v0.3 or use another model like Llama3, this problem does not appear for me.

Raw output using streaming. Same problem happens without streaming except that I get no output because it is busy generating the blanks.

data: "\\n\\n| "

data: "Met"

data: "ric             "

data: "  "

data: "  "

data: "  | Descript"

data: "ion             "

data: "                "

data: "                "

data: "                "

data: "                "

data: "                "
...

The logs for the when the blanks are generated

sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>'], stop_token_ids=[2], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16384, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None.
INFO 06-07 04:31:54 metrics.py:334] Avg prompt throughput: 91.9 tokens/s, Avg generation throughput: 5.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 14.6%, CPU KV cache usage: 0.0%
INFO 06-07 04:31:59 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 37.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.8%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:04 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 37.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 16.9%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:09 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.0%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:14 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 19.2%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:20 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 20.3%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:25 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 21.4%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:30 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 22.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:35 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 23.6%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:40 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 24.8%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:45 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.8%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:50 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 26.9%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:55 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 28.0%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:00 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 29.1%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:05 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 30.1%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:10 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 31.2%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:15 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 32.4%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:20 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 33.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:25 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 34.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:30 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 35.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:35 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 36.6%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:40 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 37.7%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:45 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 38.7%, CPU KV cache usage: 0.0%

Jun 07 '24 15:06 dawu415

encountered the same issue

Jun 28 '24 00:06 zichaow

For some reason, setting min_tokens=1 did not work for me, but min_tokens=2 worked.

Jul 01 '24 05:07 fc2869

same issue here.

Jul 13 '24 02:07 ronchengang

same issue here, changing min_tokens won't help my case.

Jul 25 '24 17:07 ShengGuanWSU

Experienced this today, runnin a somewhat exotic GPTQ quant: ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ

Sample request

curl {{host}}/v1/chat/completions -H 'Content-Type: application/json' -H "Authorization: Bearer ---" -d '{
  "model": "ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ",
  "messages": [
    {
      "role": "user",
      "content": "Answer in one word. Where is Paris?"
    }
  ],
  "min_tokens": 2,
  "max_tokens": 8
}'

Sample response

{
  "id": "chat-9e94785f8fd94d1d93be478cb96ddaea",
  "object": "chat.completion",
  "created": 1722955194,
  "model": "ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 20,
    "completion_tokens": 8
  }
}

Engine args

--cpu-offload-gb 26 --max-model-len 1024 --enforce-eager

I've tried many other combinations, however they didn't work either

Aug 06 '24 14:08 av