AutoAWQ
AutoAWQ copied to clipboard
Batching with fuse_layers = True leads to different outputs
Text:
text = ['Short summary of self-attention:', 'Short summary of Ukraine war:', 'Short summary of relativity:']
Model:
model = AutoAWQForCausalLM.from_quantized(model_path, fuse_layers = True, safetensors = True)
Processing them sequentially (one at a time) leads to the outputs:
Self-attention is a mechanism that allows models to weigh the importance of different parts of input when making predictions. It does this by computing attention scores between each pair of inputs, and then using these scores to compute weighted averages of ...
The conflict in eastern Ukraine began in 2014, after the ousting of President Viktor Yanukovych. Pro-Russian separatists seized control of several cities and regions in the east, forming two self ...
Relativity is a theory in physics that explains the laws of space and time. It was developed by Albert Einstein, who proposed two theories: Special Relativity (1905) and General Relativity (1915).
But if you batch these together:
Self-attention is a mechanism that allows models to weigh the importance of different parts of input when making predictions. It does this by computing attention scores between each pair of inputs, and then using these scores to compute weighted averages of ...
The Ukrainian conflict began in 2014, when pro-Russian separatists seized control of government buildings in several cities in eastern Ukraine. This led to a military intervention by the Ukrainian army and ultimately a cease ...
- The laws of physics are the same for all observers in uniform motion relative to one another (principle of relativity).
- The speed of light is always constant, regardless of the motion of its source or observer
So essentially all but the first in the batch have different outputs than expected. This problem only happens when fusing = True.
#215 should resolve this. I need to test it more to make sure it’s correct.
Can you drop an example?
Yeah tried this and it didn't work unfortunately
Can you show me the code you used to see the difference?
So this is with https://github.com/casper-hansen/AutoAWQ/pull/215: if num_new_tokens == 1:
text = ['Short summary of self-attention:', 'Short summary of Ukraine war:']
model_inputs = tokenizer(
text, padding=True, truncation=True, return_tensors="pt").to('cuda')
generate_kwargs = dict(
model_inputs,
max_new_tokens=50,
repetition_penalty=1.2,
)
generated_output = model.generate(**generate_kwargs)
# Strips input prompts from answers
answers = tokenizer.batch_decode(
generated_output[:, model_inputs["input_ids"].shape[1] :],
skip_special_tokens=True,
)
Without fusing:
Self-attention is a mechanism that allows models to weigh the importance of different parts of input when making predictions. It does this by computing attention scores between each pair of inputs, and then using these scores to compute weighted averages of
The conflict in eastern Ukraine began in 2014, after the ousting of President Viktor Yanukovych. Pro-Russian separatists seized control of several cities and regions in the east, forming two self
With fusing:
\begin{itemize} \item Self-attention is a mechanism that allows models to weigh the importance of different input elements in making predictions. \item It does this by computing a weighted sum of all inputs, where
The conflict in eastern Ukraine began in 2014, after the ousting of President Viktor Yanukovych. Pro-Russian separatists seized control of several cities and regions in the east of the country,
If we use the old condition: if num_new_tokens in [0,1]:
Self-attention is a mechanism that allows models to weigh the importance of different parts of input when making predictions. It does this by computing attention scores between each pair of inputs, and then using these scores to compute weighted averages of
The Ukrainian conflict began in 2014, when pro-Russian separatists seized control of government buildings in several cities in eastern Ukraine. This led to a military intervention by the Ukrainian army and ultimately a cease
So for the old conditions, the first in the batch is correct. But for the new condition, the whole batch is incorrect.
Also, if you set num_new_tokens == 1: and only do a batch size of 1, the output is wrong (I'm comparing to the correct unfused output). But with num_new_tokens in [0, 1]: it is fine.
Leads me to think prepare_input_ids might not be the issue.
Yeah, I am not sure the new PR was a good fix. I have to assess this further before the next release. Have you been able to identify where the issue could be?
So the issue seems to be primarily with padded tokens/attention masks, which then leads to the incorrect outputs for batched inputs. Example:
text = ['Short summary of relativity:']
No padding (correct)
Relativity is a theory in physics that explains the laws of space and time. It was developed by Albert Einstein, who proposed two theories: Special Relativity (1905) and General Relativity (1915).
Padded to 20 tokens
The theory of general relativity is a theory in physics that describes the laws of gravity and their relation to other forces of nature. It was developed by Albert Einstein in 1915, and it has been extensively tested and confirmed
Padded to 30 tokens
The theory of general relativity is a theory in physics that describes the laws of physics in terms of space and time. It was developed by Albert Einstein, who published his theory in 1915. The theory explains how gravity works
I think the problem is that the forward method
https://github.com/casper-hansen/AutoAWQ/blob/9c3dfa078f82b164d886c1378784e8d1510a2a97/awq/modules/fused/model.py#L22
is not using the attention_mask from the tokenizer, but rather creating a default triangular one: https://github.com/casper-hansen/AutoAWQ/blob/9c3dfa078f82b164d886c1378784e8d1510a2a97/awq/utils/fused_utils.py#L40-L48
I'll try and look into it further
It looks like the main problem here is that you are not passing in the input_ids and instead the output of the tokenizer. Please see the example below:
https://github.com/casper-hansen/AutoAWQ/blob/8110e028c7fe496287d9092d2255f3b7fa6bdd2d/examples/basic_generate.py#L23-L33
No this is not the issue. I am passing both the input_ids and attention_mask from the tokenizer to generate(). The reason your example above works is because:
- There is no batching
- As a corollary, there is no padding (this is where the issue arises)
To explain more clearly:
If I want to pad an input text (as is needed in batching):
text = ['Short summary of relativity:']
model_inputs = tokenizer(
text, padding='max_length', max_length = 10, return_tensors="pt").to('cuda')
print(model_inputs)
we get 3 padding tokens (2) and 7 prompt tokens:
{'input_ids': tensor([[ 2, 2, 2, 1, 11530, 14060, 302, 1016, 28283, 28747]],
device='cuda:0'), 'attention_mask': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
The problem is, this attention_mask is not being used/transformed during the prefill https://github.com/casper-hansen/AutoAWQ/blob/8110e028c7fe496287d9092d2255f3b7fa6bdd2d/awq/utils/fused_utils.py#L18-L26
so model.generate(**model_inputs) has the same output as model.generate(input_ids = model_inputs['input_ids']).
Essentially this is just being treated as a prompt of </s></s></s><s> Short summary of relativity:.
We need to implement something like this which takes into account the attention mask from the tokenizer when creating the 4d mask:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L892-L898
The main difference I see is:
They add another mask to the causal mask to take into account the padding tokens (this ensures the model isn't attending to these during the attention calculation.
@younesbelkada any idea if this is the correct path forward?
Would be nice to have an integration with from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask. Just not sure what the steps are to make it compatible as the shapes will probably be different.
I implemented something like _prepare_4d_causal_attention_mask to take into account the padding tokens. This partially fixed the issue. However, when comparing the actual vs expected tensors, it appears that two operations are causing a divergence (I'm not saying these are wrong btw, just that they don't seem to be working with the padded inputs):
- The RoPE calculation
- The single_query_attention calc during decoding
I need to look into these a bit further to see if I can narrow it down. Out of curiosity, how do you debug this if you can't run it locally?
I implemented something like
_prepare_4d_causal_attention_maskto take into account the padding tokens. This partially fixed the issue. However, when comparing the actual vs expected tensors, it appears that two operations are causing a divergence (I'm not saying these are wrong btw, just that they don't seem to be working with the padded inputs):
- The RoPE calculation
- The single_query_attention calc during decoding
I need to look into these a bit further to see if I can narrow it down. Out of curiosity, how do you debug this if you can't run it locally?
The RoPE calculation should not be affected, but the attention could be. If you want to make it compatible, you would have to reshape the attention mask so that it fits the correct dimensions for these CUDA extensions to work.
For debugging, I usually rent from RunPod and SSH through VS Code.
Cool, will try and look into this