vllm icon indicating copy to clipboard operation
vllm copied to clipboard

VLLM for Qwen 2.5 72B produces all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! outputs, regardless of prompt given GPTQ 4 bits quantization

Open manitadayon opened this issue 9 months ago • 30 comments

Your current environment

I performed GPTQ quantization on Qwen 72B instruct using AutoGPTQ package, with the following configuration: group_size = 32, desc_order= 32. Then I use the model inside the VLLM using the following configuration:

llm = LLM(model = model_path, max_model_len = 20000)

messages = [
{
"role": "system"
"content": system message
},
{"role": "user",
"content": user message
}
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True)
output = llm.generate(. . . )

However regardless of prompt the outptut is always !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The same code works perfectly fine for llama 3.3 and 3.1 70B.

Is Qwen 2.5 72B not compatible with VLLM. I have the latest version of VLLM and Transformers using

!pip install --upgrade vllm
!pip install --upgrade transformers

Any help would be appreciated.

🐛 Describe the bug

The output is always !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! no matter the input and the prompt or other configurations.

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

manitadayon avatar Mar 03 '25 09:03 manitadayon

Is VLLM compatible with GPTQ 4bit quantization of Qwen instruct? has anyone run this successfully? Do you guys know if this is related to quantization to 4 bits or there is other problems I am not aware of that.

manitadayon avatar Mar 03 '25 09:03 manitadayon

maybe fixed by https://github.com/vllm-project/vllm/pull/11493

noooop avatar Mar 03 '25 09:03 noooop

maybe fixed by https://github.com/vllm-project/vllm/pull/11493

Sorry can you elaborate? I looked at the PR and I do not know what I should do to fix the problem. I am not passing the quantization parameter to LLM, so I think I am using GTPQ marlin kernel. But I still have the error. So technically based on the PR I should not even have the issue in the first place since the PR is merged. I may be missing something, if it is otherwise, please elaborate on this.

manitadayon avatar Mar 03 '25 10:03 manitadayon

May I ask what version of vllm you have?

noooop avatar Mar 03 '25 10:03 noooop

I use !pip install --upgrade vllm The version is 0.7.3

manitadayon avatar Mar 03 '25 10:03 manitadayon

@jeejeelee

Please take a look and help

noooop avatar Mar 03 '25 10:03 noooop

@manitadayon

noooop avatar Mar 03 '25 10:03 noooop

@manitadayon

I will try the Qwen2 one and let you know. I do not think I have Nan in my parameters.

manitadayon avatar Mar 03 '25 10:03 manitadayon

Any GPTQ-Int4 model from the official hf repository is fine, it doesn't have to be qwen2, it can be qwen2.5.

noooop avatar Mar 03 '25 10:03 noooop

Coulf you plz provibe your running script?

jeejeelee avatar Mar 03 '25 14:03 jeejeelee

#13035 possibly related?

benlemasurier avatar Mar 03 '25 16:03 benlemasurier

Any GPTQ-Int4 model from the official hf repository is fine, it doesn't have to be qwen2, it can be qwen2.5.

Tried the Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 from huggingface, and this worked. To be honest the performance was extremely weak. I see the configuration for that was desc_order = False, group_size = 128. My configuration: desc_order= True, group_size = 32.

manitadayon avatar Mar 03 '25 16:03 manitadayon

Coulf you plz provibe your running script? @jeejeelee, sure, please see the following.

quant_config  = BaseQuantizeConfig(
bits = 4, group_size = 32, desc_order = True, damp_percent = 0.01)

quantize_model = AutoGPTQForCausallM.from_pretrained(
model, max_memory = max_memory)
quantize_model .quantize(dataset, batch_size = 1,use_trition = False)

manitadayon avatar Mar 03 '25 16:03 manitadayon

Any GPTQ-Int4 model from the official hf repository is fine, it doesn't have to be qwen2, it can be qwen2.5.

Tried the Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 from huggingface, and this worked. To be honest the performance was extremely weak. I see the configuration for that was desc_order = False, group_size = 128. My configuration: desc_order= True, group_size = 32.

After more testing, I realize bad performance is due to long context length. Which was strange since the same config was working great with Llama. Regardless still trying to understand why GPTQ quantization did not work with VLLM.

@jeejeelee @noooop Do you think this is because of using AutoGPTQ package or because of configuration of desc_order and group size as I explained above. Either way is very strange to me.

manitadayon avatar Mar 04 '25 02:03 manitadayon

@jeejeelee

According to https://github.com/vllm-project/vllm/pull/11493, Qwen model NAN results will lead to !!!!! output.

The gpu to cpu conversion actually happens in sampler. Before this, adding nan detection will synchronize the cuda stream, resulting in performance degradation.

There is no particularly good place to run a runtime check hidden_or_intermediate_states for nan

draft

noooop avatar Mar 04 '25 02:03 noooop

I see,why would it produce NAN result: Just trying to understand the action I need to take to resolve this, should I requantize the model, should I change the model parameters

manitadayon avatar Mar 04 '25 03:03 manitadayon

I see,why would it produce NAN result: Just trying to understand the action I need to take to resolve this, should I requantize the model, should I change the model parameters

We first locate and confirm the problem, then try to solve it

noooop avatar Mar 04 '25 03:03 noooop

@manitadayon

Is there any way you can modify the vllm code (in python site-packages) to output

print(torch.isnan(hidden_or_intermediate_states).any())

before

https://github.com/vllm-project/vllm/blob/bb5b640359cc6695cb7818a24680e226f72a4da7/vllm/worker/model_runner.py#L1788

Is it too hacky, but reinstalling vllm from source takes a long time

noooop avatar Mar 04 '25 03:03 noooop

Sure, given my setup it is pretty difficult to modify a package and install it from that source but try to test this idea.

manitadayon avatar Mar 04 '25 03:03 manitadayon

@noooop Thank you for providing this very useful information. I will verify the NaN output ASAP

jeejeelee avatar Mar 05 '25 02:03 jeejeelee

@noooop @manitadayon I can reproduce this issue by using Qwen1.5-14B-Chat-GPTQ, and now I've implemented a temporary solution which could fix this issue locally, please see: https://github.com/jeejeelee/vllm/blob/qwen2-overflow-clamp/vllm/model_executor/models/qwen2.py#L237-L246. The code snippet used to reproduce is as follows:

import vllm
from vllm import SamplingParams


MODEL_PATH = "/model/Qwen1.5-14B-Chat-GPTQ"

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

template = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"

prompts = [template.format(question=prompt) for prompt in prompts]
sampling_params = SamplingParams(temperature=0.0, top_p=0.95)
llm = vllm.LLM(
    MODEL_PATH,
    max_num_seqs=2,
    trust_remote_code=True,
    max_model_len=1024,
    tensor_parallel_size=2,
    enforce_eager=True, 
)

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


jeejeelee avatar Mar 05 '25 05:03 jeejeelee

@jeejeelee

Didn't expect that

I thought this code had been optimized for several years and bugs free already.

I think all GPTQ and even all quantized models using fp16 will be affected.

This means we need to add cast_overflow_tensors to all quantized layers or

We need to modify the cuda kernel to solve this problem


In fact, I don't know why quantized model use fp16 as default dtype, and converte the bf16 model to fp16.

awq Marlin and GPTQ Marlin both support bf16

https://github.com/vllm-project/vllm/blob/32985bed7c88f654b11f919ead34d77e846c32e3/vllm/model_executor/layers/quantization/awq_marlin.py#L82

https://github.com/vllm-project/vllm/blob/32985bed7c88f654b11f919ead34d77e846c32e3/vllm/model_executor/layers/quantization/gptq_marlin.py#L108

noooop avatar Mar 05 '25 06:03 noooop

@noooop I agree, I just want to give you feedback on the results of my testing

jeejeelee avatar Mar 05 '25 07:03 jeejeelee

@mgoin Could you plz look at this thread, thanks

jeejeelee avatar Mar 05 '25 08:03 jeejeelee

Is there a model uploaded to HF that I can reproduce with? I would assume this issue is specific to group_size=32, is this accurate? I would not be surprised if there are issues with this config since there aren't many group_size=32 models out there.

I found a Qwen GPTQ model with "bits": 4, "group_size": 32, "damp_percent": 0.01, "desc_act": true that I uploaded here (https://huggingface.co/mgoin/Qwen1.5-14B-Chat-GPTQ), and was able to get a reasonable GSM8k eval on it:

lm_eval --model vllm --model_args pretrained=Qwen1.5-14B-Chat-GPTQ,quantization=gptq_marlin --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=Qwen1.5-14B-Chat-GPTQ,quantization=gptq_marlin), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6861|±  |0.0128|
|     |       |strict-match    |     5|exact_match|↑  |0.5383|±  |0.0137|

So I'm going to say I haven't been able to reproduce this yet.

mgoin avatar Mar 05 '25 16:03 mgoin

Is there a model uploaded to HF that I can reproduce with? I would assume this issue is specific to group_size=32, is this accurate? I would not be surprised if there are issues with this config since there aren't many group_size=32 models out there.

I found a Qwen GPTQ model with "bits": 4, "group_size": 32, "damp_percent": 0.01, "desc_act": true that I uploaded here (https://huggingface.co/mgoin/Qwen1.5-14B-Chat-GPTQ), and was able to get a reasonable GSM8k eval on it:

lm_eval --model vllm --model_args pretrained=Qwen1.5-14B-Chat-GPTQ,quantization=gptq_marlin --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=Qwen1.5-14B-Chat-GPTQ,quantization=gptq_marlin), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6861|±  |0.0128|
|     |       |strict-match    |     5|exact_match|↑  |0.5383|±  |0.0137|

So I'm going to say I haven't been able to reproduce this yet.

I just reproduced the error again. Unfortunately I cannot upload the model to HF. My config was this time to use HF GPTQ quantization (as opposed to AutoGPTQ) with group_size = 32, desc_order= False (I have tried true as well with no luck). Maybe another thing to check is the calibration data: I used my own as opposed to Wikitext or C4. I have done it for Qwen-instruct-72B. Maybe there is something about this model. Two more possibilities to check: Calibration data, model.

May I know if you guys pass any external parameter to LLM besides model_id and max_model_len?

manitadayon avatar Mar 05 '25 16:03 manitadayon

@mgoin Thanks for your response, could you test with TP=2? I tested locally and TP=1 produced reasonable results. If I remember correctly, we downloaded the model from Qwen1.5-14B-Chat-GPTQ-Int4

jeejeelee avatar Mar 05 '25 16:03 jeejeelee

After more testing, I don't think NaN is causing this. See this:

https://huggingface.co/bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-GGUF/discussions/5

For similar problem, and the issue was corrupted files. This is not even Qwen model.

manitadayon avatar Mar 05 '25 20:03 manitadayon

@manitadayon Could #13750 be related?

rainkert avatar Mar 06 '25 12:03 rainkert

@rainkert no, that PR is strictly fixing a bug with gptq_marlin for MoE layers

mgoin avatar Mar 06 '25 14:03 mgoin