transformers Padding causes forward to produce different logits (Llama2-7b)

System Info

transformers version: 4.36.2
Platform: Linux-5.15.107+-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.20.1
Safetensors version: 0.4.1
Accelerate version: 0.22.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu118 (True)

Who can help?

@ArthurZucker @yun

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I have noticed that running forward on a padded sequence and an unpadded sequence yields (slightly) different logits, even with an attention mask specified.

Here is what I ran:

from transformers import AutoTokenizer, AutoModelForCausalLM
from torch import tensor
import torch

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16)
model.to(2)
model.eval()
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

prompt_1 = "I am so so confused."
prompt_2 = "I have never been more lost in my life!"
combined_prompt = [prompt_1, prompt_2]

combined_inputs = tokenizer(combined_prompt, padding=True, return_tensors="pt").to(2) # batch size 2
combined_inputs

>>> {'input_ids': tensor([[    2,     2,     2,     2,     1,   306,   626,   577,   577,  9613,
         29889], [    1,   306,   505,  2360,  1063,   901,  5714,   297,   590,  2834,
         29991]], device='cuda:2'), 'attention_mask': tensor([[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:2')}

combined_inputs_1 = {"input_ids": combined_inputs["input_ids"][0].unsqueeze(0), "attention_mask": combined_inputs["attention_mask"][0].unsqueeze(0)} # extracting just the first item in the batch
combined_inputs_1

>>> {'input_ids': tensor([[    2,     2,     2,     2,     1,   306,   626,   577,   577,  9613,
          29889]], device='cuda:2'),
 'attention_mask': tensor([[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]], device='cuda:2')}

# running forward prop and then visualising the last 7 logits
with torch.no_grad():
    combined_outputs_1 = model(**combined_inputs_1)
combined_logits_1 = combined_outputs_1.logits[0, 4:, :]
combined_logits_1

>>> tensor([[-12.5000,  -7.0625,  -0.6406,  ...,  -6.6250,  -7.9062,  -7.2812],
        [ -9.5000, -12.1875,  -1.1172,  ...,  -5.0625,  -8.9375,  -3.6250],
        [ -7.0312,  -4.4688,   2.1875,  ...,  -1.8438,  -5.6562,  -1.8984],
        ...,
        [ -6.9375,  -7.4062,   4.3438,  ...,  -2.8594,  -3.1875,  -3.1875],
        [ -2.4219,  -2.0000,  11.0625,  ...,  -0.6914,  -0.1133,  -1.4141],
        [-11.8750, -10.8750,   8.3750,  ...,  -4.8125,  -4.3750,  -3.6094]],
       device='cuda:2')

inputs_1 = tokenizer(prompt_1, padding=True, return_tensors="pt").to(2) # batch size 1
inputs_1

>>> {'input_ids': tensor([[    1,   306,   626,   577,   577,  9613, 29889]], device='cuda:2'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]], device='cuda:2')}

Notice that inputs_1 is the same as combined_inputs_1, except with the left padding omitted and the attention mask altered to match.

# running forward prop again
with torch.no_grad():
    outputs_1 = model(**inputs_1)
    logits_1 = outputs_1.logits[0, :, :]
logits_1

>>> tensor([[-12.5000,  -7.0625,  -0.6406,  ...,  -6.6250,  -7.9062,  -7.2812],
        [ -9.5000, -12.1875,  -1.1016,  ...,  -5.0312,  -8.9375,  -3.6250],
        [ -7.0625,  -4.4375,   2.2188,  ...,  -1.8750,  -5.7188,  -1.9219],
        ...,
        [ -6.9062,  -7.3125,   4.3438,  ...,  -2.8594,  -3.1875,  -3.1406],
        [ -2.4219,  -2.0000,  11.0625,  ...,  -0.6680,  -0.1445,  -1.4062],
        [-11.8125, -10.8125,   8.3750,  ...,  -4.7812,  -4.3125,  -3.5938]],
       device='cuda:2')

Upon close inspection, you'll see that this tensor is slightly different to combined_logits_1. We can show this more clearly:

torch.sum(torch.abs(logits_1 - combined_logits_1))

>>> tensor(3722.9448, device='cuda:2')

Is this meaningful? Well, if we look at the probabilities:

torch.max(torch.abs(torch.nn.Softmax()(logits_1) - torch.nn.Softmax()(combined_logits_1)))

>>> tensor(0.0053, device='cuda:2')

That's a pretty non-trivial probability!

Expected behavior

I would expect the attention mask to mask out the left padding, making the two sequences inputs_1 and combined_inputs_1 identical during forward prop, which should in turn mean that the logits produced are equivalent.

I realise that there may be small errors arising from batched GPU computations, but this error doesn't seem very small...

Feb 15 '24 05:02 c3ianwu

Cc @younesbelkada too :)

Feb 15 '24 09:02 amyeroberts

I believe this comment is relevant to this issue: https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535

Feb 15 '24 10:02 amyeroberts

On point @amyeroberts TLDR it's expected

Feb 15 '24 13:02 ArthurZucker

Thanks @amyeroberts @ArthurZucker

I did a few more experiments based on the issue linked by @amyeroberts

KV Cache

I turned these off:

with torch.no_grad():
    combined_outputs_1 = model(**combined_inputs_1, use_cache=False)

etc. but this did not lead to any change. My understanding is that KV caching is disabled by default for forward so I'm not surprised.

FP32

I loaded model weights in fp32. There is still a noticeable difference but the difference is smaller.

torch.sum(torch.abs(logits_1 - combined_logits_1))

>>> tensor(169.4670, device='cuda:2')

torch.max(torch.abs(torch.nn.Softmax()(logits_1) - torch.nn.Softmax()(combined_logits_1)))

>>> tensor(0.0002, device='cuda:2')

FP16

I loaded model weights in fp16. There is still a noticeable difference but the difference is smaller than for bf16 but larger than for fp32.

torch.sum(torch.abs(logits_1 - combined_logits_1))

>>> tensor(510.8704, device='cuda:2')

torch.max(torch.abs(torch.nn.Softmax()(logits_1) - torch.nn.Softmax()(combined_logits_1)))

>>> tensor(0.0006, device='cuda:2')

CPU

I ran forward prop on CPU rather than GPU. The difference is now tiny.

torch.sum(torch.abs(logits_1 - combined_logits_1))

>>> tensor(0.3935)

torch.max(torch.abs(torch.nn.Softmax()(logits_1) - torch.nn.Softmax()(combined_logits_1)))

>>> tensor(6.5565e-07)

Right Padding

I changed padding to right padding on CPU. The error is now even smaller but still non-zero:

torch.sum(torch.abs(logits_1 - combined_logits_1))

>>> tensor(0.2899)

torch.max(torch.abs(torch.nn.Softmax()(logits_1) - torch.nn.Softmax()(combined_logits_1)))

>>> tensor(1.1325e-06)

Thoughts?

Feb 15 '24 14:02 c3ianwu

I found this bug too. You can test whether enable sqda or flash attention. When sqda is used, the result seems to be correct. I did not know why this bug happen.

Feb 17 '24 16:02 SY-Xuan

This was already answered, basically eager attention still attend to padding tokens (because the output of the softmax is never non zero) but with exact implementations / kernels, you have 0 for the padding tokens instead of a very tiny number. See #27050

Feb 17 '24 19:02 ArthurZucker

I have done some experiments. If I use the eager attention with sdpa attention mask (version==4.37.2), the results are correct. However, with the eager mode attention mask, the results are wrong. This happens when using left padding for inference.

The generated 4d attention mask looks like, eager mode

[[[[     0., -65504., -65504.,  ..., -65504., -65504., -65504.],
          [     0.,      0., -65504.,  ..., -65504., -65504., -65504.],
          [     0.,      0.,      0.,  ..., -65504., -65504., -65504.],
          ...,
          [     0.,      0.,      0.,  ...,      0., -65504., -65504.],
          [     0.,      0.,      0.,  ...,      0.,      0., -65504.],
          [     0.,      0.,      0.,  ...,      0.,      0.,      0.]]],
        [[[-65504., -65504., -65504.,  ..., -65504., -65504., -65504.],
          [-65504.,      0., -65504.,  ..., -65504., -65504., -65504.],
          [-65504.,      0.,      0.,  ..., -65504., -65504., -65504.],
          ...,
          [-65504.,      0.,      0.,  ...,      0., -65504., -65504.],
          [-65504.,      0.,      0.,  ...,      0.,      0., -65504.],
          [-65504.,      0.,      0.,  ...,      0.,      0.,      0.]]]]

sdpa mode

tensor([[[[     0., -65504., -65504.,  ..., -65504., -65504., -65504.],
          [     0.,      0., -65504.,  ..., -65504., -65504., -65504.],
          [     0.,      0.,      0.,  ..., -65504., -65504., -65504.],
          ...,
          [     0.,      0.,      0.,  ...,      0., -65504., -65504.],
          [     0.,      0.,      0.,  ...,      0.,      0., -65504.],
          [     0.,      0.,      0.,  ...,      0.,      0.,      0.]]],


        [[[0, 0, 0,  ..., 0, 0, 0],
          [-65504.,      0., -65504.,  ..., -65504., -65504., -65504.],
          [-65504.,      0.,      0.,  ..., -65504., -65504., -65504.],
          ...,
          [-65504.,      0.,      0.,  ...,      0., -65504., -65504.],
          [-65504.,      0.,      0.,  ...,      0.,      0., -65504.],
          [-65504.,      0.,      0.,  ...,      0.,      0.,      0.]]]],

Feb 18 '24 03:02 SY-Xuan

I don't understand what is wrong?

Feb 18 '24 06:02 ArthurZucker

Please take a look at the discussions above. If left padding is used, the output of the model is wrong. I found that the attention mask can be generated with eager mode and sdpa mode. The difference is that if no element is attended the sdpa mode will set attention mask to zero. If sdpa mode attention mask generation is used, the output of the model is correct. I test with eager attention module and sdpa attention module. I am wondering why this happens.

Feb 19 '24 06:02 SY-Xuan

@SY-Xuan let's try to be clear when we say:

results are correct / wrong: what is wrong for you? You did not share generation, nor did you provide a snippet. Should I assume you are talking about @c3ianwu's results? Do you have the same setup as he does?
I used x and y: there are many different combination, sdpa attention, eager attention etc. Providing a small snippet of what you tests will help us understand what you mean by if no element is attended the sdpa mode will set attention mask to zero. If sdpa mode attention mask generation is used, the output of the model is correct.
outputs are wrong: are you talking about logits? About generation? Are you doing greedy decoding ? Sampling? etc etc

The reason why sdpa uses 0 attention is because there is a bug with sdpa that does not support un-attended lines. 0 in the causal mask means that it will be attended.

Now if you have a snippet with a reproducer, that will help.

Feb 20 '24 03:02 ArthurZucker

Thanks for your kind reply. I think I made a mistake by using different dtypes. I have fixed this by now. Sorry for the wasting of your time.

Feb 22 '24 02:02 SY-Xuan

No worries! 🤗 I'll close this as completed

Feb 22 '24 03:02 ArthurZucker

transformers transformers copied to clipboard

Padding causes forward to produce different logits (Llama2-7b)

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard