accelerate bloom-7b inference - RuntimeError: expected scalar type Half but found Float

trafficstars

System Info

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.12.0
- Platform: Linux-5.15.0-1020-aws-x86_64-with-glibc2.31
- Python version: 3.9.4
- Numpy version: 1.23.3
- PyTorch version (GPU?): 1.11.0+cu102 (True)
- `Accelerate` default config:
        Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[x] My own task or dataset (give details below)

Reproduction

Follow up of https://github.com/huggingface/accelerate/issues/736

Tests are run in AWS g4dn.xlarge machine (single gpu).

Using the following code snippet, I am able to fit the bloom 7 billion model.

import torch
from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, AutoModel
from transformers import BloomTokenizerFast, BloomForSequenceClassification

hg_checkpoint = "bigscience/bloom-7b1"

print("Initializing tokenizer")
tokenizer = BloomTokenizerFast.from_pretrained(hg_checkpoint)

hg_model = BloomForSequenceClassification.from_pretrained(
    hg_checkpoint, device_map="auto", offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
    )

print(hg_model)
print("Model loaded successfully")

pytorch_total_params = sum(p.numel() for p in hg_model.parameters())

print("Total number of parameters: ", pytorch_total_params)

Trying to infer sequence classification model - Reference - https://huggingface.co/docs/transformers/main/en/model_doc/bloom#transformers.BloomForSequenceClassification.forward.example-3

inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
with torch.no_grad():
    logits = hg_model(**inputs).logits
    print(logits)

Throws the following error

Model loaded successfully
Total number of parameters:  7069024256
{'input_ids': tensor([[206449,    333,  13897,   1809,  36424,    427,  69319,    267,   2084,
           6210,    664,    368,   9325,  42544]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Traceback (most recent call last):
  File "/home/ubuntu/test1.py", line 36, in <module>
    logits = hg_model(**inputs).logits
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 1012, in forward
    logits = self.score(hidden_states)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Half but found Float

Even if i explicitly convert the input_ids and attention_mask to torch.float16 as below

inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
inputs["input_ids"] = inputs.input_ids.to(torch.float16)
inputs["attention_mask"] = inputs.attention_mask.to(torch.float16)
hg_model(**inputs)

Following error is produced

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 999, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 689, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

Expected behavior

inference should work without any errors

Oct 14 '22 05:10 shrinath-suresh

Since you didn't put your inputs on the GPU, the generation part after the model runs is done on the CPU (Accelerate makes the model return outputs on the same device as the inputs). On CPU, most operations are not supported in float16, which is why you have this error.

You should just put your inputs on the GPU to solve the problem :-)

Oct 14 '22 14:10 sgugger

@sgugger Thanks for the quick reply.. I did try that too

inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
inputs = inputs.to("cuda:0")

inputs["input_ids"] = inputs.input_ids.to(torch.float16)
inputs["attention_mask"] = inputs.attention_mask.to(torch.float16)
print(inputs)
hg_model(**inputs)

{'input_ids': tensor([[   inf,   333., 13896.,  1809., 36416.,   427.,    inf,   267.,  2084.,
          6208.,   664.,   368.,  9328., 42560.]], device='cuda:0',
       dtype=torch.float16), 'attention_mask': tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
       device='cuda:0', dtype=torch.float16)}

Same error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 999, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 689, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

Am i missing something here?

Full repo script - bloom7b.zip

Oct 14 '22 17:10 shrinath-suresh

As the error tells you the inputs of the models are integers. Why are you converting them to float16?

Oct 14 '22 18:10 sgugger

@sgugger I converted the input to float 16, because the model throws the following error when the input is (int or long) (Stack trace can be found in the issue description - First block)

RuntimeError: expected scalar type Half but found Float

Oct 15 '22 17:10 shrinath-suresh

In fact, its not accepting any of the data types

Long

inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
inputs = inputs.to("cuda:0")

inputs["input_ids"] = inputs["input_ids"].to(torch.long)
inputs["attention_mask"] = inputs["attention_mask"].to(torch.long)

throw RuntimeError: expected scalar type Half but found Float

Float

inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
inputs = inputs.to("cuda:0")

inputs["input_ids"] = inputs["input_ids"].to(torch.float)
inputs["attention_mask"] = inputs["attention_mask"].to(torch.float)

throws - RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

INT

inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
inputs = inputs.to("cuda:0")

inputs["input_ids"] = inputs["input_ids"].to(torch.int)
inputs["attention_mask"] = inputs["attention_mask"].to(torch.int)

throws - RuntimeError: expected scalar type Half but found Float

Oct 15 '22 17:10 shrinath-suresh

@sgugger To get more clarity, i tested the model in p3.8xlarge . It has 4 gpus and more RAM . Bloom 7b model can be loaded without offloading.

bloom7b_p3_8xlarge.zip

Without offloading , the inference is working as expected.

Only when the model is offloaded, i am seeing the datatype error

hg_model = BloomForSequenceClassification.from_pretrained(
    hg_checkpoint, device_map="auto", offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
    )

Could this be a bug or do we need to perform any intermediate steps before inference if the model is offloaded ?

Oct 16 '22 05:10 shrinath-suresh

I can I dived a bit in the issue as I didn't understand why you had the bug even for the right input dtype. Turns out it's a bug in Transformers: the randomly initialized head for classification is not in float16 as requested.

Note that the model prediction will be crap since there is a randomly initialized head, nevertheless, I'll fix the bug in the coming days :-)

Oct 17 '22 12:10 sgugger

Thank you very much for looking into it @sgugger

Oct 17 '22 16:10 shrinath-suresh

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 13 '22 15:11 github-actions[bot]

The above PR has been merged, so this should be solved :-)

Nov 14 '22 06:11 sgugger

accelerate accelerate copied to clipboard

bloom-7b inference - RuntimeError: expected scalar type Half but found Float

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard