AutoAWQ
AutoAWQ copied to clipboard
[BUG] Quantizing GPT NeoX raises an error
First of all, thank you for great work.
System info
autoawq==0.1.8
Details
While I tried to quantize GPT NeoX model, encountered the error below.
>>> from awq import AutoAWQForCausalLM
>>> from transformers import AutoTokenizer
>>> model_path = 'EleutherAI/gpt-neox-20b'
>>> quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
>>> model = AutoAWQForCausalLM.from_pretrained(model_path)
>>> tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
>>> model.quantize(tokenizer, quant_config=quant_config)
Generating validation split: 214670 examples [00:09, 22942.50 examples/s]
AWQ: 0%| | 0/44 [00:00<?, ?it/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/awq/models/base.py", line 93, in quantize
quantizer.quantize()
File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 95, in quantize
input_feat = self._get_input_feat(self.modules[i], named_linears)
File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 406, in _get_input_feat
self.inps = layer(self.inps, **module_kwargs)[0]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 690, in forward
attention_layer_outputs = self.attention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 213, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 287, in _attn
attn_scores = attn_scores + attention_mask
RuntimeError: The size of tensor a (512) must match the size of tensor b (60) at non-singleton dimension 2
After digging into codes, it turned out that this line broke. GPT NeoX model transforms attention mask shaped (batch_size, seq_len) into (batch_size, 1, 1, seq_len) before reaching GPTNeoXLayer. On quantizing, input to GPTNeoXModel does not include attention_mask so transformation does not occur. After that, attention mask, whose shape is not compatible with GPTNeoXLayer, is appended in the above line, passed to it, and the error is raised.
I think that prepare_inputs_for_generation should be called before feeding to modules[0] since it is the same as in model.generate() in transformer.
Hi, kevin, i encountered the same problem, did you solve it?
This kind of transient issue has been popping up every since transformers 4.36 was released. Unfortunately, the code since transformers 4.36 is unfavorable to handle these issues around input arguments in general. For now, I am not sure how to solve these errors in general. One method would be to pip install transformers<4.36.0 and find an AutoAWQ version that works with it.
CC @younesbelkada
I had the same issue today, installing transformers 4.35.2 seems to have worked.