bitsandbytes
bitsandbytes copied to clipboard
RuntimeError: expected scalar type Half but found Float
I met the following error when I tried to train bloom-7b1-mt with peft LoRA in 8bit+fp16 (torch amp) mode:
Traceback (most recent call last):
File "finetune.py", line 141, in <module>
train(args)
File "finetune.py", line 133, in train
trainer.train()
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py", line 1638, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py", line 1903, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py", line 2660, in training_step
tmp.backward()
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py", line 456, in backward
grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
RuntimeError: expected scalar type Half but found Float
which does not appear during training llama-7b with exactly the same settings.
Also, it does not appear if I set fp16=False.
model = AutoModel.from_pretrained(
BLOOM_MODEL_PATH,
trust_remote_code=True,
load_in_8bit=True,
torch_dtype=torch.float16,
device_map='auto',
)
model = prepare_model_for_int8_training(model)
model = get_peft_model(model, lora_config)
...
trainer = transformers.Trainer(
args=transformers.TrainingArguments(
...
fp16=False,
...
),
...
)
I met the excatly same problem, did you fix it? @yuyq96
Not yet, this also happened when I tried to use GPT-J. A temporary solution is to set fp16=False to use int8+fp32 training. @zhouyu5
Not yet, this also happened when I tried to use GPT-J. A temporary solution is to set fp16=False to use int8+fp32 training. @zhouyu5
Thanks, but after setting fp16=False, the loss become 0.0, it is strange, did you meet this situation? @yuyq96
Not yet, this also happened when I tried to use GPT-J. A temporary solution is to set fp16=False to use int8+fp32 training. @zhouyu5
Thanks, but after setting fp16=False, the loss become 0.0, it is strange, did you meet this situation? @yuyq96
Yes, I met this problem. However, it happens when I use mixed Chinese-English data, but not when I use plain English data. I'm not sure what is causing this.
Not yet, this also happened when I tried to use GPT-J. A temporary solution is to set fp16=False to use int8+fp32 training. @zhouyu5
Thanks, but after setting fp16=False, the loss become 0.0, it is strange, did you meet this situation? @yuyq96
Yes, I met this problem. However, it happens when I use mixed Chinese-English data, but not when I use plain English data. I'm not sure what is causing this.
Thanks for the input. I still got loss=0 even if I use pure English data. Confusing...
Same problem. I am experiencing the same issue on my V100 32GB.
I have the same error in V100 32 GB. I have tried with other GPUs and they seem to work. It would be great if it could work on V100
Same error on v100 16GB when using peft, but it works on RTX3060 12GB.
I have the same issue on this Colab training an OPT-6.7b using peft
https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing
I am using a Tesla P40
I adjusted the training loop with torch.autocast("cuda")
Likewise I have a lot of 0 training losses, but some accurate ones interspersed
Step | Training Loss |
---|---|
1 | 0.000000 |
2 | 0.000000 |
3 | 0.000000 |
4 | 2.445800 |
5 | 1.594100 |
6 | 0.000000 |
7 | 0.000000 |
8 | 0.000000 |
same with v100
I am also facing the same issue with v100
load_in_8bit=False worked for me
This error occurs when the 2 matrices you are multiplying are not of same dtype.
Half means dtype = torch.float16 while, Float means dtype = torch.float32
to resolve the error simply cast your model weights into float16
for param in model.parameters():
# Check if parameter dtype is Float (float32)
if param.dtype == torch.float32:
param.data = param.data.to(torch.float16)
load_in_8bit=False worked for me
because you not use bitsandbytes hhhh
have you found any solutions to this if i stick to using 8-bit model? Thanks!
Hi!
I've solved by adding:
with torch.autocast("cuda"):
trainer.train()
Hi!
I've solved by adding:
with torch.autocast("cuda"): trainer.train()
This solves my problem, thanks!
Hi! I've solved by adding:
with torch.autocast("cuda"): trainer.train()
This solves my problem, thanks!
I found adding this may cause the failure of convergence when training a 8-bit model, especially when used together with prepare_model_for_kbit_training(model). Just a reminder in case someone else encounters the same issue.
with torch.autocast("cuda"): trainer.train()
i did:
# # model = prepare_model_for_int8_training(model)
for param in model.parameters():
# Check if parameter dtype is Float (float32)
if param.dtype == torch.float32:
param.data = param.data.to(torch.float16)
# Verifying the datatypes.
dtypes = {}
for _, p in model.named_parameters():
dtype = p.dtype
if dtype not in dtypes:
dtypes[dtype] = 0
dtypes[dtype] += p.numel()
total = 0
for k, v in dtypes.items():
total += v
for k, v in dtypes.items():
print(k, v, v / total)
....
with torch.autocast("cuda"):
trainer.train()
running the training, will keep everyone posted
Hi! I've solved by adding:
with torch.autocast("cuda"): trainer.train()
This solves my problem, thanks!
I found adding this may cause the failure of convergence when training a 8-bit model, especially when used together with prepare_model_for_kbit_training(model). Just a reminder in case someone else encounters the same issue.
You can check the logits and loss, it may have become nan. I have similar issue when I fine-tune baichuan7B and I find this is caused by overflow on fp16. Graphics cards with Ampere and Turing architectures support bf16 which is less prone to overflow so A100/RTX30XX don't have same issue.
for param in model.parameters(): # Check if parameter dtype is Float (float32) if param.dtype == torch.float32: param.data = param.data.to(torch.float16)
Changing the param's datatype manually helped me. Training has started :)
This error occurs when the 2 matrices you are multiplying are not of same dtype.
Half means dtype = torch.float16 while, Float means dtype = torch.float32
to resolve the error simply cast your model weights into float16
for param in model.parameters(): # Check if parameter dtype is Float (float32) if param.dtype == torch.float32: param.data = param.data.to(torch.float16)
Thanks! But isn't the whole point of prepare_model_for_int8_training
to cast non int8 parameters to fp32? (see its definition - https://github.com/huggingface/peft/blob/main/src/peft/utils/other.py#L79) I wonder if these parameters should be kept fp32 for correct training. Did you notice any difference training completely on fp16?
This error occurs when the 2 matrices you are multiplying are not of same dtype.
Half means dtype = torch.float16 while, Float means dtype = torch.float32
to resolve the error simply cast your model weights into float16
for param in model.parameters(): # Check if parameter dtype is Float (float32) if param.dtype == torch.float32: param.data = param.data.to(torch.float16)
solved
I was using peft with whisper model and got this issue.
with torch.autocast("cuda"):
trainer.train()
solved the issue.
I guess, AMP uses lower precision (like float16 also known as half precision) for certain operations which are less sensitive to precision, while keeping others (like the loss calculation) in higher precision (like float32).
torch.autocast
context manager will enable AMP for the operations inside its block.
I am using TRL's SFTTrainer
to train OPT (loaded in int8 + LoRA), and seeing the same issue. While
with torch.autocast("cuda"):
trainer.train()
Solves the precision mismatch issue, it causes the loss to become 0.
I have an OPT quantized to 8 bit trained using IA3 in 16 bit and had this issue in inferencing.