AutoAWQ icon indicating copy to clipboard operation
AutoAWQ copied to clipboard

Batching with fuse_layers = True leads to different outputs

Open cassianlewis opened this issue 2 years ago • 14 comments

Text: text = ['Short summary of self-attention:', 'Short summary of Ukraine war:', 'Short summary of relativity:']

Model: model = AutoAWQForCausalLM.from_quantized(model_path, fuse_layers = True, safetensors = True)

Processing them sequentially (one at a time) leads to the outputs:

Self-attention is a mechanism that allows models to weigh the importance of different parts of input when making predictions. It does this by computing attention scores between each pair of inputs, and then using these scores to compute weighted averages of ...

The conflict in eastern Ukraine began in 2014, after the ousting of President Viktor Yanukovych. Pro-Russian separatists seized control of several cities and regions in the east, forming two self ...

Relativity is a theory in physics that explains the laws of space and time. It was developed by Albert Einstein, who proposed two theories: Special Relativity (1905) and General Relativity (1915).

But if you batch these together:

Self-attention is a mechanism that allows models to weigh the importance of different parts of input when making predictions. It does this by computing attention scores between each pair of inputs, and then using these scores to compute weighted averages of ...

The Ukrainian conflict began in 2014, when pro-Russian separatists seized control of government buildings in several cities in eastern Ukraine. This led to a military intervention by the Ukrainian army and ultimately a cease ...

  1. The laws of physics are the same for all observers in uniform motion relative to one another (principle of relativity).
  2. The speed of light is always constant, regardless of the motion of its source or observer

So essentially all but the first in the batch have different outputs than expected. This problem only happens when fusing = True.

cassianlewis avatar Nov 27 '23 16:11 cassianlewis

#215 should resolve this. I need to test it more to make sure it’s correct.

Can you drop an example?

casper-hansen avatar Nov 27 '23 18:11 casper-hansen

Yeah tried this and it didn't work unfortunately

cassianlewis avatar Nov 28 '23 17:11 cassianlewis

Can you show me the code you used to see the difference?

casper-hansen avatar Dec 03 '23 13:12 casper-hansen

So this is with https://github.com/casper-hansen/AutoAWQ/pull/215: if num_new_tokens == 1:

text = ['Short summary of self-attention:', 'Short summary of Ukraine war:']
model_inputs = tokenizer(
    text, padding=True, truncation=True, return_tensors="pt").to('cuda')

generate_kwargs = dict(
    model_inputs,
    max_new_tokens=50,
    repetition_penalty=1.2,
)

generated_output = model.generate(**generate_kwargs)

# Strips input prompts from answers
answers = tokenizer.batch_decode(
    generated_output[:, model_inputs["input_ids"].shape[1] :],
    skip_special_tokens=True,
)

Without fusing:

Self-attention is a mechanism that allows models to weigh the importance of different parts of input when making predictions. It does this by computing attention scores between each pair of inputs, and then using these scores to compute weighted averages of

The conflict in eastern Ukraine began in 2014, after the ousting of President Viktor Yanukovych. Pro-Russian separatists seized control of several cities and regions in the east, forming two self

With fusing:

\begin{itemize} \item Self-attention is a mechanism that allows models to weigh the importance of different input elements in making predictions. \item It does this by computing a weighted sum of all inputs, where

The conflict in eastern Ukraine began in 2014, after the ousting of President Viktor Yanukovych. Pro-Russian separatists seized control of several cities and regions in the east of the country,

If we use the old condition: if num_new_tokens in [0,1]:

Self-attention is a mechanism that allows models to weigh the importance of different parts of input when making predictions. It does this by computing attention scores between each pair of inputs, and then using these scores to compute weighted averages of

The Ukrainian conflict began in 2014, when pro-Russian separatists seized control of government buildings in several cities in eastern Ukraine. This led to a military intervention by the Ukrainian army and ultimately a cease

So for the old conditions, the first in the batch is correct. But for the new condition, the whole batch is incorrect.

Also, if you set num_new_tokens == 1: and only do a batch size of 1, the output is wrong (I'm comparing to the correct unfused output). But with num_new_tokens in [0, 1]: it is fine.

Leads me to think prepare_input_ids might not be the issue.

cassianlewis avatar Dec 04 '23 10:12 cassianlewis

Yeah, I am not sure the new PR was a good fix. I have to assess this further before the next release. Have you been able to identify where the issue could be?

casper-hansen avatar Dec 08 '23 12:12 casper-hansen

So the issue seems to be primarily with padded tokens/attention masks, which then leads to the incorrect outputs for batched inputs. Example:

text = ['Short summary of relativity:']

No padding (correct)

Relativity is a theory in physics that explains the laws of space and time. It was developed by Albert Einstein, who proposed two theories: Special Relativity (1905) and General Relativity (1915).

Padded to 20 tokens

The theory of general relativity is a theory in physics that describes the laws of gravity and their relation to other forces of nature. It was developed by Albert Einstein in 1915, and it has been extensively tested and confirmed

Padded to 30 tokens

The theory of general relativity is a theory in physics that describes the laws of physics in terms of space and time. It was developed by Albert Einstein, who published his theory in 1915. The theory explains how gravity works

I think the problem is that the forward method

https://github.com/casper-hansen/AutoAWQ/blob/9c3dfa078f82b164d886c1378784e8d1510a2a97/awq/modules/fused/model.py#L22

is not using the attention_mask from the tokenizer, but rather creating a default triangular one: https://github.com/casper-hansen/AutoAWQ/blob/9c3dfa078f82b164d886c1378784e8d1510a2a97/awq/utils/fused_utils.py#L40-L48

I'll try and look into it further

cassianlewis avatar Dec 12 '23 13:12 cassianlewis

It looks like the main problem here is that you are not passing in the input_ids and instead the output of the tokenizer. Please see the example below:

https://github.com/casper-hansen/AutoAWQ/blob/8110e028c7fe496287d9092d2255f3b7fa6bdd2d/examples/basic_generate.py#L23-L33

casper-hansen avatar Dec 12 '23 18:12 casper-hansen

No this is not the issue. I am passing both the input_ids and attention_mask from the tokenizer to generate(). The reason your example above works is because:

  1. There is no batching
  2. As a corollary, there is no padding (this is where the issue arises)

To explain more clearly:

If I want to pad an input text (as is needed in batching):

text = ['Short summary of relativity:']
model_inputs = tokenizer(
    text, padding='max_length', max_length = 10, return_tensors="pt").to('cuda')

print(model_inputs)

we get 3 padding tokens (2) and 7 prompt tokens:

{'input_ids': tensor([[    2,     2,     2,     1, 11530, 14060,   302,  1016, 28283, 28747]],
       device='cuda:0'), 'attention_mask': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

The problem is, this attention_mask is not being used/transformed during the prefill https://github.com/casper-hansen/AutoAWQ/blob/8110e028c7fe496287d9092d2255f3b7fa6bdd2d/awq/utils/fused_utils.py#L18-L26 so model.generate(**model_inputs) has the same output as model.generate(input_ids = model_inputs['input_ids']).

Essentially this is just being treated as a prompt of </s></s></s><s> Short summary of relativity:.

cassianlewis avatar Dec 13 '23 11:12 cassianlewis

We need to implement something like this which takes into account the attention mask from the tokenizer when creating the 4d mask:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L892-L898

The main difference I see is:

They add another mask to the causal mask to take into account the padding tokens (this ensures the model isn't attending to these during the attention calculation.

cassianlewis avatar Dec 13 '23 11:12 cassianlewis

@younesbelkada any idea if this is the correct path forward?

cassianlewis avatar Dec 19 '23 10:12 cassianlewis

Would be nice to have an integration with from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask. Just not sure what the steps are to make it compatible as the shapes will probably be different.

casper-hansen avatar Dec 21 '23 22:12 casper-hansen

I implemented something like _prepare_4d_causal_attention_mask to take into account the padding tokens. This partially fixed the issue. However, when comparing the actual vs expected tensors, it appears that two operations are causing a divergence (I'm not saying these are wrong btw, just that they don't seem to be working with the padded inputs):

  1. The RoPE calculation
  2. The single_query_attention calc during decoding

I need to look into these a bit further to see if I can narrow it down. Out of curiosity, how do you debug this if you can't run it locally?

cassianlewis avatar Dec 24 '23 15:12 cassianlewis

I implemented something like _prepare_4d_causal_attention_mask to take into account the padding tokens. This partially fixed the issue. However, when comparing the actual vs expected tensors, it appears that two operations are causing a divergence (I'm not saying these are wrong btw, just that they don't seem to be working with the padded inputs):

  1. The RoPE calculation
  2. The single_query_attention calc during decoding

I need to look into these a bit further to see if I can narrow it down. Out of curiosity, how do you debug this if you can't run it locally?

The RoPE calculation should not be affected, but the attention could be. If you want to make it compatible, you would have to reshape the attention mask so that it fits the correct dimensions for these CUDA extensions to work.

For debugging, I usually rent from RunPod and SSH through VS Code.

casper-hansen avatar Dec 27 '23 20:12 casper-hansen

Cool, will try and look into this

cassianlewis avatar Jan 02 '24 09:01 cassianlewis