vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Generate nothing from VLLM output

Open FocusLiwen opened this issue 2 years ago • 34 comments

When I run batch inferences, sometimes, the output from vLLM is empty, meaning prediction is empty. Could we make it at least it generate one token? The output is empty is also strange.

FocusLiwen avatar Sep 26 '23 18:09 FocusLiwen

@FocusLiwen can you add some more detail like how you are using inferencing , sampling params & what is your request

AnupKumarJha avatar Sep 26 '23 19:09 AnupKumarJha

Hi, I used tensor_parallel_size as 2 with seed=0 and the following parameters: "max_tokens": 128, "temperature": 0, "top_p": 1.0, "top_k": -1 I also extract the output from the generation call:

"gold": {"text": "Sports", "supplements": {}}, "predictions": [{"text": "", "raw_text": "", "logprob": 0, "tokens": []}]}

FocusLiwen avatar Sep 26 '23 20:09 FocusLiwen

In Huggingface generation api, there is a parameter called: min_gen_len, which could be set as 1 to avoid zero output. But with vLLM, there is no such parameters.

FocusLiwen avatar Sep 26 '23 20:09 FocusLiwen

Does this happen if you increase the temperature to 1e-3 or 1e-2?

viktor-ferenczi avatar Sep 28 '23 09:09 viktor-ferenczi

In Huggingface generation api, there is a parameter called: min_gen_len, which could be set as 1 to avoid zero output. But with vLLM, there is no such parameters.

We ran into the same problem (no output). Being able to set a minimal length for the generated text will be very helpful.

summer66 avatar Oct 04 '23 19:10 summer66

The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).

eaubin avatar Jan 02 '24 18:01 eaubin

The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).

same, have you solve this?

raihan0824 avatar Jan 06 '24 04:01 raihan0824

The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).

same, have you solve this?

This is the same for me as well. It says 1024 completion_tokens, but the content is blank. The dolphin version seems to work TheBloke/dolphin-2.6-mixtral-8x7b-AWQ.

Andrew-MAQ avatar Jan 08 '24 22:01 Andrew-MAQ

The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).

same, have you solve this?

This is the same for me as well. It says 1024 completion_tokens, but the content is blank. The dolphin version seems to work TheBloke/dolphin-2.6-mixtral-8x7b-AWQ.

Seeing the same thing. Could it be a problem with the model itself? Wonder if TheBloke's GPTQ version works.

hnhlester avatar Jan 11 '24 22:01 hnhlester

It is likely your weights are corrupted

sfc-gh-hazhang avatar Jan 26 '24 00:01 sfc-gh-hazhang

The Mixtral AWQ vllm example gives empty output (with temperature 0,0.5,1.0 or default sampling parameters).

same problem, has it been resolved?

HIT-Owen avatar Feb 02 '24 03:02 HIT-Owen

Same issue there! using vllm==0.3.0+cu118 with TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ something is definitely wrong. It outputs an empty string despite calculation and gpu usage if anyone knows why

imadcap avatar Feb 06 '24 15:02 imadcap

Same problem. Some of my generation outputs are empty with Mixtral-8x7B-Instruct

arifcraft avatar Feb 16 '24 14:02 arifcraft

same here with mistral the output is empty

hahmad2008 avatar Mar 24 '24 08:03 hahmad2008

I ecountered the same issue, empty string text output, with the TheBloke Mixtral AWQ with vllm and 2 loaders from Oobaboogas Web UI as well. However, ybelkada/Mixtral-8x7B-Instruct-v0.1-AWQ worked on both vLLM and the WebUI for me.

I'm still not 100% sure its a faulty model, so I'd be happy If one of you can confirm (or deny) this with your setup.

Meersalzeis avatar Apr 01 '24 11:04 Meersalzeis

In my case using Mistral Instruct model, input with proper template and setting max_tokens from SamplingParams helps.

cieske avatar Apr 03 '24 08:04 cieske

Are there any news on this. I also get empty strings when running in batches returning the empty string for some reason. If using batch size of one this never happens. Any update on this would be great.

sAviOr287 avatar Apr 18 '24 18:04 sAviOr287

Same issue!

AmoghM avatar Apr 30 '24 18:04 AmoghM

im facing the same issue with llama 3 8b on a 48 gb vram gpu while using the outlines library to enforce json responses, the fields are empty, even though there is plenty of ram left and the model is loaded on the gpu completely.

hugocool avatar May 16 '24 14:05 hugocool

same problem, when the prompt is run for the first time it generates normally, if run again the same prompt returns an empty response. Model: phi3 medium

EDIT: I solve add min_tokens in SamplingParams

SamplingParams(temperature=0.5, min_tokens=1000)

joaograndotto avatar May 28 '24 13:05 joaograndotto

I get a similar output. Tested on vllm 0.4.2 and vllm 0.4.3. If I try to use Mistral 7B Instruct v0.2, it generates blanks for a particular given input. However, if I change the model to Mistral 7B Instruct v0.3 or use another model like Llama3, this problem does not appear for me.

Raw output using streaming. Same problem happens without streaming except that I get no output because it is busy generating the blanks.

data: "\\n\\n| "

data: "Met"

data: "ric             "

data: "  "

data: "  "

data: "  | Descript"

data: "ion             "

data: "                "

data: "                "

data: "                "

data: "                "

data: "                "
...

The logs for the when the blanks are generated

sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>'], stop_token_ids=[2], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16384, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None.
INFO 06-07 04:31:54 metrics.py:334] Avg prompt throughput: 91.9 tokens/s, Avg generation throughput: 5.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 14.6%, CPU KV cache usage: 0.0%
INFO 06-07 04:31:59 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 37.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.8%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:04 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 37.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 16.9%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:09 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.0%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:14 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 19.2%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:20 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 20.3%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:25 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 21.4%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:30 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 22.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:35 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 23.6%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:40 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 24.8%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:45 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.8%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:50 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 26.9%, CPU KV cache usage: 0.0%
INFO 06-07 04:32:55 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 28.0%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:00 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 29.1%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:05 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 30.1%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:10 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 31.2%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:15 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 32.4%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:20 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 35.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 33.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:25 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 34.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:30 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 35.5%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:35 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 36.6%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:40 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 37.7%, CPU KV cache usage: 0.0%
INFO 06-07 04:33:45 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 34.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 38.7%, CPU KV cache usage: 0.0%

dawu415 avatar Jun 07 '24 15:06 dawu415

encountered the same issue

zichaow avatar Jun 28 '24 00:06 zichaow

For some reason, setting min_tokens=1 did not work for me, but min_tokens=2 worked.

fc2869 avatar Jul 01 '24 05:07 fc2869

same issue here.

ronchengang avatar Jul 13 '24 02:07 ronchengang

same issue here, changing min_tokens won't help my case.

ShengGuanWSU avatar Jul 25 '24 17:07 ShengGuanWSU

Experienced this today, runnin a somewhat exotic GPTQ quant: ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ

Sample request

curl {{host}}/v1/chat/completions -H 'Content-Type: application/json' -H "Authorization: Bearer ---" -d '{
  "model": "ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ",
  "messages": [
    {
      "role": "user",
      "content": "Answer in one word. Where is Paris?"
    }
  ],
  "min_tokens": 2,
  "max_tokens": 8
}'
Sample response

{
  "id": "chat-9e94785f8fd94d1d93be478cb96ddaea",
  "object": "chat.completion",
  "created": 1722955194,
  "model": "ChenMnZ/Mistral-Large-Instruct-2407-EfficientQAT-w2g64-GPTQ",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 20,
    "completion_tokens": 8
  }
}

Engine args

--cpu-offload-gb 26 --max-model-len 1024 --enforce-eager

I've tried many other combinations, however they didn't work either

av avatar Aug 06 '24 14:08 av