accelerate
accelerate copied to clipboard
bloom-7b inference - RuntimeError: expected scalar type Half but found Float
System Info
Copy-and-paste the text below in your GitHub issue
- `Accelerate` version: 0.12.0
- Platform: Linux-5.15.0-1020-aws-x86_64-with-glibc2.31
- Python version: 3.9.4
- Numpy version: 1.23.3
- PyTorch version (GPU?): 1.11.0+cu102 (True)
- `Accelerate` default config:
Not found
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [x] My own task or dataset (give details below)
Reproduction
Follow up of https://github.com/huggingface/accelerate/issues/736
Tests are run in AWS g4dn.xlarge machine (single gpu).
Using the following code snippet, I am able to fit the bloom 7 billion model.
import torch
from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, AutoModel
from transformers import BloomTokenizerFast, BloomForSequenceClassification
hg_checkpoint = "bigscience/bloom-7b1"
print("Initializing tokenizer")
tokenizer = BloomTokenizerFast.from_pretrained(hg_checkpoint)
hg_model = BloomForSequenceClassification.from_pretrained(
hg_checkpoint, device_map="auto", offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
)
print(hg_model)
print("Model loaded successfully")
pytorch_total_params = sum(p.numel() for p in hg_model.parameters())
print("Total number of parameters: ", pytorch_total_params)
Trying to infer sequence classification model - Reference - https://huggingface.co/docs/transformers/main/en/model_doc/bloom#transformers.BloomForSequenceClassification.forward.example-3
inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
with torch.no_grad():
logits = hg_model(**inputs).logits
print(logits)
Throws the following error
Model loaded successfully
Total number of parameters: 7069024256
{'input_ids': tensor([[206449, 333, 13897, 1809, 36424, 427, 69319, 267, 2084,
6210, 664, 368, 9325, 42544]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Traceback (most recent call last):
File "/home/ubuntu/test1.py", line 36, in <module>
logits = hg_model(**inputs).logits
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 1012, in forward
logits = self.score(hidden_states)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Half but found Float
Even if i explicitly convert the input_ids and attention_mask to torch.float16 as below
inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
inputs["input_ids"] = inputs.input_ids.to(torch.float16)
inputs["attention_mask"] = inputs.attention_mask.to(torch.float16)
hg_model(**inputs)
Following error is produced
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 999, in forward
transformer_outputs = self.transformer(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 689, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2183, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
Expected behavior
inference should work without any errors
Since you didn't put your inputs on the GPU, the generation part after the model runs is done on the CPU (Accelerate makes the model return outputs on the same device as the inputs). On CPU, most operations are not supported in float16, which is why you have this error.
You should just put your inputs on the GPU to solve the problem :-)
@sgugger Thanks for the quick reply.. I did try that too
inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
inputs = inputs.to("cuda:0")
inputs["input_ids"] = inputs.input_ids.to(torch.float16)
inputs["attention_mask"] = inputs.attention_mask.to(torch.float16)
print(inputs)
hg_model(**inputs)
{'input_ids': tensor([[ inf, 333., 13896., 1809., 36416., 427., inf, 267., 2084.,
6208., 664., 368., 9328., 42560.]], device='cuda:0',
dtype=torch.float16), 'attention_mask': tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
device='cuda:0', dtype=torch.float16)}
Same error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 999, in forward
transformer_outputs = self.transformer(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 689, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2183, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
Am i missing something here?
Full repo script - bloom7b.zip
As the error tells you the inputs of the models are integers. Why are you converting them to float16?
@sgugger I converted the input to float 16, because the model throws the following error when the input is (int or long) (Stack trace can be found in the issue description - First block)
RuntimeError: expected scalar type Half but found Float
In fact, its not accepting any of the data types
Long
inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
inputs = inputs.to("cuda:0")
inputs["input_ids"] = inputs["input_ids"].to(torch.long)
inputs["attention_mask"] = inputs["attention_mask"].to(torch.long)
throw RuntimeError: expected scalar type Half but found Float
Float
inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
inputs = inputs.to("cuda:0")
inputs["input_ids"] = inputs["input_ids"].to(torch.float)
inputs["attention_mask"] = inputs["attention_mask"].to(torch.float)
throws - RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
INT
inputs = tokenizer("Bloomberg has decided to publish a new report on the global economy", return_tensors="pt")
inputs = inputs.to("cuda:0")
inputs["input_ids"] = inputs["input_ids"].to(torch.int)
inputs["attention_mask"] = inputs["attention_mask"].to(torch.int)
throws - RuntimeError: expected scalar type Half but found Float
@sgugger To get more clarity, i tested the model in p3.8xlarge . It has 4 gpus and more RAM . Bloom 7b model can be loaded without offloading.
Without offloading , the inference is working as expected.
Only when the model is offloaded, i am seeing the datatype error
hg_model = BloomForSequenceClassification.from_pretrained(
hg_checkpoint, device_map="auto", offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
)
Could this be a bug or do we need to perform any intermediate steps before inference if the model is offloaded ?
I can I dived a bit in the issue as I didn't understand why you had the bug even for the right input dtype. Turns out it's a bug in Transformers: the randomly initialized head for classification is not in float16 as requested.
Note that the model prediction will be crap since there is a randomly initialized head, nevertheless, I'll fix the bug in the coming days :-)
Thank you very much for looking into it @sgugger
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
The above PR has been merged, so this should be solved :-)