bitsandbytes icon indicating copy to clipboard operation
bitsandbytes copied to clipboard

RuntimeError: expected scalar type Half but found Float

Open yuyq96 opened this issue 1 year ago • 26 comments

I met the following error when I tried to train bloom-7b1-mt with peft LoRA in 8bit+fp16 (torch amp) mode:

Traceback (most recent call last):
  File "finetune.py", line 141, in <module>
    train(args)
  File "finetune.py", line 133, in train
    trainer.train()
  File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py", line 1638, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py", line 2660, in training_step
    tmp.backward()
  File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward
    self, gradient, retain_graph, create_graph, inputs=inputs
  File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
  File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
  File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/yyq/.conda/envs/transformers/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py", line 456, in backward
    grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
RuntimeError: expected scalar type Half but found Float

which does not appear during training llama-7b with exactly the same settings.

Also, it does not appear if I set fp16=False.

model = AutoModel.from_pretrained(
    BLOOM_MODEL_PATH,
    trust_remote_code=True,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map='auto',
)
model = prepare_model_for_int8_training(model)
model = get_peft_model(model, lora_config)
...
trainer = transformers.Trainer(
    args=transformers.TrainingArguments(
        ...
        fp16=False,
        ...
    ),
    ...
)

yuyq96 avatar Mar 30 '23 11:03 yuyq96

I met the excatly same problem, did you fix it? @yuyq96

zhouyu5 avatar Apr 07 '23 06:04 zhouyu5

Not yet, this also happened when I tried to use GPT-J. A temporary solution is to set fp16=False to use int8+fp32 training. @zhouyu5

yuyq96 avatar Apr 07 '23 07:04 yuyq96

Not yet, this also happened when I tried to use GPT-J. A temporary solution is to set fp16=False to use int8+fp32 training. @zhouyu5

Thanks, but after setting fp16=False, the loss become 0.0, it is strange, did you meet this situation? @yuyq96

zhouyu5 avatar Apr 07 '23 07:04 zhouyu5

Not yet, this also happened when I tried to use GPT-J. A temporary solution is to set fp16=False to use int8+fp32 training. @zhouyu5

Thanks, but after setting fp16=False, the loss become 0.0, it is strange, did you meet this situation? @yuyq96

Yes, I met this problem. However, it happens when I use mixed Chinese-English data, but not when I use plain English data. I'm not sure what is causing this.

yuyq96 avatar Apr 07 '23 07:04 yuyq96

Not yet, this also happened when I tried to use GPT-J. A temporary solution is to set fp16=False to use int8+fp32 training. @zhouyu5

Thanks, but after setting fp16=False, the loss become 0.0, it is strange, did you meet this situation? @yuyq96

Yes, I met this problem. However, it happens when I use mixed Chinese-English data, but not when I use plain English data. I'm not sure what is causing this.

Thanks for the input. I still got loss=0 even if I use pure English data. Confusing...

zhouyu5 avatar Apr 07 '23 08:04 zhouyu5

Same problem. I am experiencing the same issue on my V100 32GB.

chenjohnai avatar Apr 08 '23 14:04 chenjohnai

I have the same error in V100 32 GB. I have tried with other GPUs and they seem to work. It would be great if it could work on V100

adibMosharrof avatar Apr 25 '23 08:04 adibMosharrof

Same error on v100 16GB when using peft, but it works on RTX3060 12GB.

My-captain avatar May 11 '23 11:05 My-captain

I have the same issue on this Colab training an OPT-6.7b using peft

https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing

I am using a Tesla P40

I adjusted the training loop with torch.autocast("cuda")

Likewise I have a lot of 0 training losses, but some accurate ones interspersed

Step Training Loss
1 0.000000
2 0.000000
3 0.000000
4 2.445800
5 1.594100
6 0.000000
7 0.000000
8 0.000000

psych0v0yager avatar May 20 '23 06:05 psych0v0yager

same with v100

Qualia-Li avatar May 30 '23 17:05 Qualia-Li

I am also facing the same issue with v100

itsaadish avatar Jun 14 '23 17:06 itsaadish

load_in_8bit=False worked for me

Qualia-Li avatar Jun 14 '23 19:06 Qualia-Li

This error occurs when the 2 matrices you are multiplying are not of same dtype.

Half means dtype = torch.float16 while, Float means dtype = torch.float32

to resolve the error simply cast your model weights into float16

for param in model.parameters():
    # Check if parameter dtype is  Float (float32)
    if param.dtype == torch.float32:
        param.data = param.data.to(torch.float16)

RishitToteja avatar Jun 19 '23 06:06 RishitToteja

load_in_8bit=False worked for me

because you not use bitsandbytes hhhh

zhanshijinwat avatar Jun 27 '23 08:06 zhanshijinwat

have you found any solutions to this if i stick to using 8-bit model? Thanks!

andotalao24 avatar Jul 04 '23 17:07 andotalao24

Hi!

I've solved by adding:

with torch.autocast("cuda"): 
    trainer.train()

andrespimartin avatar Jul 05 '23 12:07 andrespimartin

Hi!

I've solved by adding:

with torch.autocast("cuda"): 
    trainer.train()

This solves my problem, thanks!

YangRui2015 avatar Jul 08 '23 05:07 YangRui2015

Hi! I've solved by adding:

with torch.autocast("cuda"): 
    trainer.train()

This solves my problem, thanks!

I found adding this may cause the failure of convergence when training a 8-bit model, especially when used together with prepare_model_for_kbit_training(model). Just a reminder in case someone else encounters the same issue.

andotalao24 avatar Jul 08 '23 05:07 andotalao24

with torch.autocast("cuda"): trainer.train()

i did:
# # model = prepare_model_for_int8_training(model)
for param in model.parameters():
    # Check if parameter dtype is  Float (float32)
    if param.dtype == torch.float32:
        param.data = param.data.to(torch.float16)

# Verifying the datatypes.
dtypes = {}
for _, p in model.named_parameters():
    dtype = p.dtype
    if dtype not in dtypes:
        dtypes[dtype] = 0
    dtypes[dtype] += p.numel()
total = 0
for k, v in dtypes.items():
    total += v
for k, v in dtypes.items():
    print(k, v, v / total)

....
with torch.autocast("cuda"): 
    trainer.train()

running the training, will keep everyone posted

andysingal avatar Jul 22 '23 09:07 andysingal

Hi! I've solved by adding:

with torch.autocast("cuda"): 
    trainer.train()

This solves my problem, thanks!

I found adding this may cause the failure of convergence when training a 8-bit model, especially when used together with prepare_model_for_kbit_training(model). Just a reminder in case someone else encounters the same issue.

You can check the logits and loss, it may have become nan. I have similar issue when I fine-tune baichuan7B and I find this is caused by overflow on fp16. Graphics cards with Ampere and Turing architectures support bf16 which is less prone to overflow so A100/RTX30XX don't have same issue.

ParaNoth avatar Jul 26 '23 03:07 ParaNoth

for param in model.parameters(): # Check if parameter dtype is Float (float32) if param.dtype == torch.float32: param.data = param.data.to(torch.float16)

Changing the param's datatype manually helped me. Training has started :)

lathashree01 avatar Jul 30 '23 15:07 lathashree01

This error occurs when the 2 matrices you are multiplying are not of same dtype.

Half means dtype = torch.float16 while, Float means dtype = torch.float32

to resolve the error simply cast your model weights into float16

for param in model.parameters():
    # Check if parameter dtype is  Float (float32)
    if param.dtype == torch.float32:
        param.data = param.data.to(torch.float16)

Thanks! But isn't the whole point of prepare_model_for_int8_training to cast non int8 parameters to fp32? (see its definition - https://github.com/huggingface/peft/blob/main/src/peft/utils/other.py#L79) I wonder if these parameters should be kept fp32 for correct training. Did you notice any difference training completely on fp16?

don-tpanic avatar Aug 09 '23 18:08 don-tpanic

This error occurs when the 2 matrices you are multiplying are not of same dtype.

Half means dtype = torch.float16 while, Float means dtype = torch.float32

to resolve the error simply cast your model weights into float16

for param in model.parameters():
    # Check if parameter dtype is  Float (float32)
    if param.dtype == torch.float32:
        param.data = param.data.to(torch.float16)

solved

Yun-Peng-Wang avatar Oct 13 '23 15:10 Yun-Peng-Wang

I was using peft with whisper model and got this issue.

with torch.autocast("cuda"): 
    trainer.train()

solved the issue.

I guess, AMP uses lower precision (like float16 also known as half precision) for certain operations which are less sensitive to precision, while keeping others (like the loss calculation) in higher precision (like float32). torch.autocast context manager will enable AMP for the operations inside its block.

monk1337 avatar Dec 17 '23 04:12 monk1337

I am using TRL's SFTTrainer to train OPT (loaded in int8 + LoRA), and seeing the same issue. While

with torch.autocast("cuda"): 
    trainer.train()

Solves the precision mismatch issue, it causes the loss to become 0.

chenmoneygithub avatar Dec 24 '23 05:12 chenmoneygithub

I have an OPT quantized to 8 bit trained using IA3 in 16 bit and had this issue in inferencing.

G-M-twostay avatar Mar 06 '24 01:03 G-M-twostay