AutoGPTQ [BUG] Encounter Nan in Int8 Models

Describe the bug After updating autogptq to the main branch, I encounter nan when running Qwen-7B-Chat-Int8 models. Int4 models are working well.

Hardware details Nvidia A100

Software version cuda=11.8 auto_gptq=0.7.0dev (main branch, installed from source) pytorch=2.1.1 transformers=4.32.0

To Reproduce

checkpoint_path = /path/to/Qwen-7B-Chat-Int8"
print('Loading tokenizer ...')
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, trust_remote_code=True)

print('Loading model ...')
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, device_map="auto", trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained(checkpoint_path, trust_remote_code=True)

response, _ = model.chat(tokenizer, "hello", history=None)
print(response)

See error

Traceback (most recent call last):
  File "/home/data/QWen-HF/test.py", line 35, in <module>
    response, _ = model.chat(tokenizer, "hello", history=None)
  File "/root/huggingface_home/modules/transformers_modules/modeling_qwen.py", line 1144, in chat
    outputs = self.generate(
  File "/root/huggingface_home/modules/transformers_modules/modeling_qwen.py", line 1266, in generate
    return super().generate(
  File "/root/miniconda3/envs/qwen-gptq/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/qwen-gptq/lib/python3.9/site-packages/transformers/generation/utils.py", line 1642, in generate
    return self.sample(
  File "/root/miniconda3/envs/qwen-gptq/lib/python3.9/site-packages/transformers/generation/utils.py", line 2760, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

I found that:

Int4 model works well.
After downgrading autogptq to tags/v0.6.0 OR deleting all compiled kernel files (such as autogptq_cuda_256.cpython-39-x86_64-linux-gnu.so), everything works well.

Feb 01 '24 14:02 hzhwcmhf

same problem

Mar 26 '24 07:03 Imsovegetable

same problem!

Apr 08 '24 08:04 kkwhale7

I ran into this issue as well, in my case with a different 8-bit quant, TheBloke/Mistral-7B-Instruct-v0.2-GPTQ:gptq-8bit-32g-actorder_True. An additional clue, in case it's helpful: when the input contains <128 tokens, a forward pass returns all NAN. But when the input has >=128 tokens, a forward pass does not return NANs.

Here is a Colab notebook to reproduce: https://colab.research.google.com/drive/1TWgUXtQ8MR_2v9B9eBFgXBGEr6r8cR_-?usp=sharing. It demonstrates both the NANs from the forward pass and the crash while generating.

Apr 30 '24 19:04 malcolmsharpe

same problem,still not fiexd

Jul 17 '24 02:07 hq-ansel