transformers Batch elements interfere with each other with int8

System Info

transformers version: cf0af9a31beb84e8feec77af51f72d063ba905aa
bitsandbytes version: 0.37.1
Platform: Linux-5.4.0-139-generic-x86_64-with-glibc2.31
Python version: 3.9.16
Huggingface_hub version: 0.12.1
PyTorch version (GPU?): 2.0.0+cu117 (True)
Using GPU in script?: yes: A100 in MIG mode
Using distributed or parallel set-up in script?: no

Who can help?

@sgugger @muell

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

The outputs of a model for a given batch element depend on the other elements in the batch when using int8 inference. See minimal example below. I'm not sure whether this is expected?

import transformers

model = transformers.AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m", load_in_8bit=True, device_map="auto")
tokenizer = transformers.AutoTokenizer.from_pretrained("bigscience/bloom-560m")
out1 = model(**tokenizer(["A"], return_tensors="pt").to("cuda"))
out2 = model(**tokenizer(["A"], ["B"], return_tensors="pt").to("cuda"))
print(out1['logits'][0][0])
print(out2['logits'][0][0])
print(out1['logits'][0][0] == out2['logits'][0][0])

> tensor([345.0000, 348.2500, 354.2500,  ..., 206.2500, 206.2500, 206.2500],
       device='cuda:0', dtype=torch.float16, grad_fn=<SelectBackward0>)
> tensor([344.7500, 347.7500, 353.7500,  ..., 206.0000, 206.0000, 206.0000],
       device='cuda:0', dtype=torch.float16, grad_fn=<SelectBackward0>)
> tensor([False, False, False,  ..., False, False, False], device='cuda:0')

Expected behavior

The computation should be independent of the other batch elements, as for fp32 (see below):

import transformers

model = transformers.AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m", load_in_8bit=False, device_map="auto").to("cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained("bigscience/bloom-560m")
out1 = model(**tokenizer(["A"], return_tensors="pt").to("cuda"))
out2 = model(**tokenizer(["A"], ["B"], return_tensors="pt").to("cuda"))
print(out1['logits'][0][0])
print(out2['logits'][0][0])
print(out1['logits'][0][0] == out2['logits'][0][0])

> tensor([343.6242, 346.4580, 352.7924,  ..., 205.3806, 205.3800, 205.3746],
       grad_fn=<SelectBackward0>)
> tensor([343.6242, 346.4580, 352.7924,  ..., 205.3806, 205.3800, 205.3746],
       grad_fn=<SelectBackward0>)
> tensor([ True,  True,  True,  ...,  True,  True, False])

Edit 2023/03/22 Corrected the code for FP32.

Mar 20 '23 13:03 leonweber

cc @younesbelkada

Mar 20 '23 13:03 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 08 '23 15:06 github-actions[bot]

transformers transformers copied to clipboard

Batch elements interfere with each other with int8

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard