[BUG] Encounter Nan in Int8 Models
Describe the bug
After updating autogptq to the main branch, I encounter nan when running Qwen-7B-Chat-Int8 models. Int4 models are working well.
Hardware details Nvidia A100
Software version cuda=11.8 auto_gptq=0.7.0dev (main branch, installed from source) pytorch=2.1.1 transformers=4.32.0
To Reproduce
checkpoint_path = /path/to/Qwen-7B-Chat-Int8"
print('Loading tokenizer ...')
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, trust_remote_code=True)
print('Loading model ...')
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, device_map="auto", trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained(checkpoint_path, trust_remote_code=True)
response, _ = model.chat(tokenizer, "hello", history=None)
print(response)
See error
Traceback (most recent call last):
File "/home/data/QWen-HF/test.py", line 35, in <module>
response, _ = model.chat(tokenizer, "hello", history=None)
File "/root/huggingface_home/modules/transformers_modules/modeling_qwen.py", line 1144, in chat
outputs = self.generate(
File "/root/huggingface_home/modules/transformers_modules/modeling_qwen.py", line 1266, in generate
return super().generate(
File "/root/miniconda3/envs/qwen-gptq/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/qwen-gptq/lib/python3.9/site-packages/transformers/generation/utils.py", line 1642, in generate
return self.sample(
File "/root/miniconda3/envs/qwen-gptq/lib/python3.9/site-packages/transformers/generation/utils.py", line 2760, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
I found that:
- Int4 model works well.
- After downgrading
autogptqto tags/v0.6.0 OR deleting all compiled kernel files (such as autogptq_cuda_256.cpython-39-x86_64-linux-gnu.so), everything works well.
same problem
same problem!
I ran into this issue as well, in my case with a different 8-bit quant, TheBloke/Mistral-7B-Instruct-v0.2-GPTQ:gptq-8bit-32g-actorder_True. An additional clue, in case it's helpful: when the input contains <128 tokens, a forward pass returns all NAN. But when the input has >=128 tokens, a forward pass does not return NANs.
Here is a Colab notebook to reproduce: https://colab.research.google.com/drive/1TWgUXtQ8MR_2v9B9eBFgXBGEr6r8cR_-?usp=sharing. It demonstrates both the NANs from the forward pass and the crash while generating.
same problem,still not fiexd