vllm
vllm copied to clipboard
VLLM for Qwen 2.5 72B produces all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! outputs, regardless of prompt given GPTQ 4 bits quantization
Your current environment
I performed GPTQ quantization on Qwen 72B instruct using AutoGPTQ package, with the following configuration: group_size = 32, desc_order= 32. Then I use the model inside the VLLM using the following configuration:
llm = LLM(model = model_path, max_model_len = 20000)
messages = [
{
"role": "system"
"content": system message
},
{"role": "user",
"content": user message
}
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True)
output = llm.generate(. . . )
However regardless of prompt the outptut is always !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The same code works perfectly fine for llama 3.3 and 3.1 70B.
Is Qwen 2.5 72B not compatible with VLLM. I have the latest version of VLLM and Transformers using
!pip install --upgrade vllm
!pip install --upgrade transformers
Any help would be appreciated.
🐛 Describe the bug
The output is always !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! no matter the input and the prompt or other configurations.
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Is VLLM compatible with GPTQ 4bit quantization of Qwen instruct? has anyone run this successfully? Do you guys know if this is related to quantization to 4 bits or there is other problems I am not aware of that.
maybe fixed by https://github.com/vllm-project/vllm/pull/11493
maybe fixed by https://github.com/vllm-project/vllm/pull/11493
Sorry can you elaborate? I looked at the PR and I do not know what I should do to fix the problem. I am not passing the quantization parameter to LLM, so I think I am using GTPQ marlin kernel. But I still have the error. So technically based on the PR I should not even have the issue in the first place since the PR is merged. I may be missing something, if it is otherwise, please elaborate on this.
May I ask what version of vllm you have?
I use !pip install --upgrade vllm
The version is 0.7.3
@jeejeelee
Please take a look and help
@manitadayon
- Can Qwen/Qwen2-72B-Instruct-GPTQ-Int4 run normally?
- Is there any nan in your parameters?
@manitadayon
- Can Qwen/Qwen2-72B-Instruct-GPTQ-Int4 run normally?
- Is there any nan in your parameters?
I will try the Qwen2 one and let you know. I do not think I have Nan in my parameters.
Any GPTQ-Int4 model from the official hf repository is fine, it doesn't have to be qwen2, it can be qwen2.5.
Coulf you plz provibe your running script?
#13035 possibly related?
Any GPTQ-Int4 model from the official hf repository is fine, it doesn't have to be qwen2, it can be qwen2.5.
Tried the Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 from huggingface, and this worked. To be honest the performance was extremely weak. I see the configuration for that was desc_order = False, group_size = 128. My configuration: desc_order= True, group_size = 32.
Coulf you plz provibe your running script? @jeejeelee, sure, please see the following.
quant_config = BaseQuantizeConfig(
bits = 4, group_size = 32, desc_order = True, damp_percent = 0.01)
quantize_model = AutoGPTQForCausallM.from_pretrained(
model, max_memory = max_memory)
quantize_model .quantize(dataset, batch_size = 1,use_trition = False)
Any GPTQ-Int4 model from the official hf repository is fine, it doesn't have to be qwen2, it can be qwen2.5.
Tried the Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 from huggingface, and this worked. To be honest the performance was extremely weak. I see the configuration for that was desc_order = False, group_size = 128. My configuration: desc_order= True, group_size = 32.
After more testing, I realize bad performance is due to long context length. Which was strange since the same config was working great with Llama. Regardless still trying to understand why GPTQ quantization did not work with VLLM.
@jeejeelee @noooop Do you think this is because of using AutoGPTQ package or because of configuration of desc_order and group size as I explained above. Either way is very strange to me.
@jeejeelee
According to https://github.com/vllm-project/vllm/pull/11493, Qwen model NAN results will lead to !!!!! output.
The gpu to cpu conversion actually happens in sampler. Before this, adding nan detection will synchronize the cuda stream, resulting in performance degradation.
There is no particularly good place to run a runtime check hidden_or_intermediate_states for nan
I see,why would it produce NAN result: Just trying to understand the action I need to take to resolve this, should I requantize the model, should I change the model parameters
I see,why would it produce NAN result: Just trying to understand the action I need to take to resolve this, should I requantize the model, should I change the model parameters
We first locate and confirm the problem, then try to solve it
@manitadayon
Is there any way you can modify the vllm code (in python site-packages) to output
print(torch.isnan(hidden_or_intermediate_states).any())
before
https://github.com/vllm-project/vllm/blob/bb5b640359cc6695cb7818a24680e226f72a4da7/vllm/worker/model_runner.py#L1788
Is it too hacky, but reinstalling vllm from source takes a long time
Sure, given my setup it is pretty difficult to modify a package and install it from that source but try to test this idea.
@noooop Thank you for providing this very useful information. I will verify the NaN output ASAP
@noooop @manitadayon I can reproduce this issue by using Qwen1.5-14B-Chat-GPTQ, and now I've implemented a temporary solution which could fix this issue locally, please see: https://github.com/jeejeelee/vllm/blob/qwen2-overflow-clamp/vllm/model_executor/models/qwen2.py#L237-L246. The code snippet used to reproduce is as follows:
import vllm
from vllm import SamplingParams
MODEL_PATH = "/model/Qwen1.5-14B-Chat-GPTQ"
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
template = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"
prompts = [template.format(question=prompt) for prompt in prompts]
sampling_params = SamplingParams(temperature=0.0, top_p=0.95)
llm = vllm.LLM(
MODEL_PATH,
max_num_seqs=2,
trust_remote_code=True,
max_model_len=1024,
tensor_parallel_size=2,
enforce_eager=True,
)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
@jeejeelee
Didn't expect that
I thought this code had been optimized for several years and bugs free already.
I think all GPTQ and even all quantized models using fp16 will be affected.
This means we need to add cast_overflow_tensors to all quantized layers or
We need to modify the cuda kernel to solve this problem
In fact, I don't know why quantized model use fp16 as default dtype, and converte the bf16 model to fp16.
awq Marlin and GPTQ Marlin both support bf16
https://github.com/vllm-project/vllm/blob/32985bed7c88f654b11f919ead34d77e846c32e3/vllm/model_executor/layers/quantization/awq_marlin.py#L82
https://github.com/vllm-project/vllm/blob/32985bed7c88f654b11f919ead34d77e846c32e3/vllm/model_executor/layers/quantization/gptq_marlin.py#L108
@noooop I agree, I just want to give you feedback on the results of my testing
@mgoin Could you plz look at this thread, thanks
Is there a model uploaded to HF that I can reproduce with? I would assume this issue is specific to group_size=32, is this accurate? I would not be surprised if there are issues with this config since there aren't many group_size=32 models out there.
I found a Qwen GPTQ model with "bits": 4, "group_size": 32, "damp_percent": 0.01, "desc_act": true that I uploaded here (https://huggingface.co/mgoin/Qwen1.5-14B-Chat-GPTQ), and was able to get a reasonable GSM8k eval on it:
lm_eval --model vllm --model_args pretrained=Qwen1.5-14B-Chat-GPTQ,quantization=gptq_marlin --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=Qwen1.5-14B-Chat-GPTQ,quantization=gptq_marlin), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.6861|± |0.0128|
| | |strict-match | 5|exact_match|↑ |0.5383|± |0.0137|
So I'm going to say I haven't been able to reproduce this yet.
Is there a model uploaded to HF that I can reproduce with? I would assume this issue is specific to
group_size=32, is this accurate? I would not be surprised if there are issues with this config since there aren't manygroup_size=32models out there.I found a Qwen GPTQ model with
"bits": 4, "group_size": 32, "damp_percent": 0.01, "desc_act": truethat I uploaded here (https://huggingface.co/mgoin/Qwen1.5-14B-Chat-GPTQ), and was able to get a reasonable GSM8k eval on it:lm_eval --model vllm --model_args pretrained=Qwen1.5-14B-Chat-GPTQ,quantization=gptq_marlin --tasks gsm8k --num_fewshot 5 --batch_size auto ... vllm (pretrained=Qwen1.5-14B-Chat-GPTQ,quantization=gptq_marlin), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.6861|± |0.0128| | | |strict-match | 5|exact_match|↑ |0.5383|± |0.0137|So I'm going to say I haven't been able to reproduce this yet.
I just reproduced the error again. Unfortunately I cannot upload the model to HF. My config was this time to use HF GPTQ quantization (as opposed to AutoGPTQ) with group_size = 32, desc_order= False (I have tried true as well with no luck). Maybe another thing to check is the calibration data: I used my own as opposed to Wikitext or C4. I have done it for Qwen-instruct-72B. Maybe there is something about this model. Two more possibilities to check: Calibration data, model.
May I know if you guys pass any external parameter to LLM besides model_id and max_model_len?
@mgoin Thanks for your response, could you test with TP=2? I tested locally and TP=1 produced reasonable results. If I remember correctly, we downloaded the model from Qwen1.5-14B-Chat-GPTQ-Int4
After more testing, I don't think NaN is causing this. See this:
https://huggingface.co/bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-GGUF/discussions/5
For similar problem, and the issue was corrupted files. This is not even Qwen model.
@manitadayon Could #13750 be related?
@rainkert no, that PR is strictly fixing a bug with gptq_marlin for MoE layers