peft modules_to_save: "ValueError: Attempting to unscale FP16 gradients"

I'm trying to finetune llama with some expanded tokens using resize_token_embeddings() and passing modules_to_save=['embed_tokens', 'lm_head'], but it seems there is some misconfiguration

Traceback (most recent call last):
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1962, in _inner_training_loop
    self.scaler.unscale_(self.optimizer)
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/home/jonathanasdf/.local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")

Apr 20 '23 00:04 jonathanasdf

So I think the problem above is some incompatibility between int8 and modules_to_save. Using float16 instead of int8 is fine.

But actually, it seems that after https://github.com/huggingface/peft/commit/c21afbe868734c0af8bd4577c4c7acdf366b96d1 setting modules_to_save for LoraConfig CAUSAL_LM doesn't do anything anymore. I had to revert to before that commit for the setting to make the modules trainable and saved in the checkpoint.

Apr 20 '23 09:04 jonathanasdf

yes, for now, if use a lower version of peft, it will report the same error as you. But if upgrade peft to latest version(0.3.0.dev0 for now), I can't save my lora weights for the weight size will became 443 bytes unexpectly. #https://github.com/oobabooga/text-generation-webui/issues/1270#issue-1669857486

Apr 23 '23 09:04 qiguanqiang

I am having the same problem here and not solved yet even I installed peft by the up-to-date repo. Just like the author of this issue, I'm trying to finetune llama with some expanded tokens using resize_token_embeddings() and passing modules_to_save=['embed_tokens', 'lm_head']. Here is what I am suffering:

The problem solved when I delete 'modules_to_save' in LoraConfig.
The problem solved if I transform the model into torch.float32 and set fp16=False in the parameters of the trainer.
Neither swithching load_in_8bit=False or True or switching low_cpu_mem_usage=True/False when load base llama model helps.
When I delete 'modules_to_save' in LoraConfig, the peftmodel object would have both torch.float32 and torch.float16 parameters, but this weird fact does not lead to error. However, if I set modules_to_save=['embed_tokens', 'lm_head'], even I use .half to transform all parameters of the peftmodel into torch.float16, with Fp16=true in the trainer, the "ValueError: Attempting to unscale FP16 gradients" still happens. @pacman100 Please review this issue, thank u~

My environment: torch==1.13.1 transformers== 4.29.0.dev0 peft = 0.3.0 dev0 CUDA == 11.7 GPU is a Tesla A100 80G.

Apr 23 '23 19:04 DavidVillaHMD

New idea: Now the training finally works. Setting fp16=False would make the training be super slow and not mem-friendly.

To avoid "ValueError: Attempting to unscale FP16 gradients", just make sure each trainable params to be in type 'torch.float32'. In my case just:

model.base_model.model.model.embed_tokens.weight.data = model.base_model.model.model.embed_tokens.weight.data.float()
model.base_model.model.lm_head.weight.data = model.base_model.model.lm_head.weight.data.float()

It seems like a bug from pytorch side.

Apr 24 '23 06:04 DavidVillaHMD

New idea: Now the training finally works. Setting fp16=False would make the training be super slow and not mem-friendly.

To avoid "ValueError: Attempting to unscale FP16 gradients", just make sure each trainable params to be in type 'torch.float32'. In my case just:
model.base_model.model.model.embed_tokens.weight.data = model.base_model.model.model.embed_tokens.weight.data.float()
model.base_model.model.lm_head.weight.data = model.base_model.model.lm_head.weight.data.float()
It seems like a bug from pytorch side.

so clever, dude. thanks for your idea

Apr 24 '23 06:04 qiguanqiang

I am having the same problem here and not solved yet even I installed peft by the up-to-date repo. Just like the author of this issue, I'm trying to finetune llama with some expanded tokens using resize_token_embeddings() and passing modules_to_save=['embed_tokens', 'lm_head']. Here is what I am suffering:

The problem solved when I delete 'modules_to_save' in LoraConfig.

The problem solved if I transform the model into torch.float32 and set fp16=False in the parameters of the trainer.

Neither swithching load_in_8bit=False or True or switching low_cpu_mem_usage=True/False when load base llama model helps.

When I delete 'modules_to_save' in LoraConfig, the peftmodel object would have both torch.float32 and torch.float16 parameters, but this weird fact does not lead to error. However, if I set modules_to_save=['embed_tokens', 'lm_head'], even I use .half to transform all parameters of the peftmodel into torch.float16, with Fp16=true in the trainer, the "ValueError: Attempting to unscale FP16 gradients" still happens. @pacman100 Please review this issue, thank u~

My environment: torch==1.13.1 transformers== 4.29.0.dev0 peft = 0.3.0 dev0 CUDA == 11.7 GPU is a Tesla A100 80G.

I have a same question.

Apr 24 '23 08:04 iMountTai

New idea: Now the training finally works. Setting fp16=False would make the training be super slow and not mem-friendly.

To avoid "ValueError: Attempting to unscale FP16 gradients", just make sure each trainable params to be in type 'torch.float32'. In my case just:
model.base_model.model.model.embed_tokens.weight.data = model.base_model.model.model.embed_tokens.weight.data.float()
model.base_model.model.lm_head.weight.data = model.base_model.model.lm_head.weight.data.float()
It seems like a bug from pytorch side.

When I try this, I get the following error model.base_model.model.model.embed_tokens.weight.data.float() File "/project/msi290_uksr/generative_tod/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'GPTJForCausalLM' object has no attribute 'model'

Could you show an example of where you actually wrote those two lines.

Apr 25 '23 09:04 adibMosharrof

New idea: Now the training finally works. Setting fp16=False would make the training be super slow and not mem-friendly. To avoid "ValueError: Attempting to unscale FP16 gradients", just make sure each trainable params to be in type 'torch.float32'. In my case just:
model.base_model.model.model.embed_tokens.weight.data = model.base_model.model.model.embed_tokens.weight.data.float()
model.base_model.model.lm_head.weight.data = model.base_model.model.lm_head.weight.data.float()
It seems like a bug from pytorch side.
When I try this, I get the following error model.base_model.model.model.embed_tokens.weight.data.float() File "/project/msi290_uksr/generative_tod/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'GPTJForCausalLM' object has no attribute 'model'

Could you show an example of where you actually wrote those two lines.

Right after I called PeftModel.from_pretrained, which combines lora weight and original model's params.

May 06 '23 07:05 DavidVillaHMD

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

May 31 '23 15:05 github-actions[bot]

I didn't use llama, but I got the same error when fine-tuning BLIP with LoRA. I checked the dtype of the parameters in all the layers and they were all float16. My solution was to change the bias= option in Lora__config from "lora_only" to "none".

config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", #<--- replace "lora_only" target_modules=["query", "key", "value", "qkv", "text_decoder.cls.predictions.decoder"], )

Sep 23 '23 12:09 Hangsiin

Any updates on this issue? Still seeing this bug

Jan 07 '24 13:01 hengjiUSTC

Any updates on this issue? Still seeing this bug

Which one exactly do you mean? Note that for the case of loading the model in float16, you have to follow the advice given above.

A snippet that should work a little bit more generally:

for param in model.parameters():
    if param.requires_grad:
        param.data = param.data.to(torch.float32)

Jan 09 '24 13:01 BenjaminBossan

Any updates on this issue? Still seeing this bug

Which one exactly do you mean? Note that for the case of loading the model in float16, you have to follow the advice given above.

A snippet that should work a little bit more generally:
for param in model.parameters():
    if param.requires_grad:
        param.data = param.data.to(torch.float32)

This is my use case test: Break with raise ValueError("Attempting to unscale FP16 gradients.") under below configs.

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=[embed_tokens, lm_head],
      )

No error for below cases

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=None,
      )

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float32,
    )
    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=[embed_tokens, lm_head],
      )

    model = AutoModelForCausalLM.from_pretrained(
        ...
        torch_dtype=torch.float16,
    )
    training_args = TrainingArguments(
        fp16=False,
        ...
    )
    peft_config = LoraConfig(
            ...
            modules_to_save=[embed_tokens, lm_head],
      )

I am confused about how to understand the relation between, torch_dtype, fp16, modules_to_save?

Jan 09 '24 13:01 hengjiUSTC

I am confused about how to understand the relation between, torch_dtype, fp16, modules_to_save?

In general, when you want to use mixed precision (i.e. fp16=True), the weights to be trained should not be loaded as float16. This is always true and has nothing to do with PEFT. When you add modules_to_save with PEFT, it means that new layers are added to the model that will be trained. These layers are copies of the layers they are supposed to replace, so they use fp16 too. Therefore, you get the error. Does this clarify why you see these results?

I will add an entry to our docs via a PR #1336 that should hopefully make it easier in the future to debug this issue.

Jan 09 '24 14:01 BenjaminBossan

Thanks so much for clarify! I now understand that if load model with float16, the copied layer will cause error during training.

Just one dummy follow up. When modules_to_save is None, the original model is loaded in float16. What is the default weights type for lora layers? (fp16=True should represent computation type of float16 if I am understanding correctly)

Jan 09 '24 14:01 hengjiUSTC

What is the default weights type for lora layers?

In general, the LoRA layers use the same dtype for their parameters as the original layers, but there can be exceptions (e.g. when using prepare_model_for_kbit_training).

Jan 09 '24 15:01 BenjaminBossan

Previously I tested that load model with float16, enable fp16 and modules_to_save equals to none. If lora layers inherit float16 from model type, I should expect to see similar error message? But actually it successfully finished training.

Jan 09 '24 16:01 hengjiUSTC

Previously I tested that load model with float16, enable fp16 and modules_to_save equals to none. If lora layers inherit float16 from model type, I should expect to see similar error message? But actually it successfully finished training.

Interesting. Not sure what exactly happened there, but here is a test that shows that trying to load a model in fp16 and training it with AMP results in the error:

https://github.com/huggingface/peft/pull/1336/files#diff-037b6550b3d063119b60801a751c3c5c97eab8e912ae8370c2dd5569d38305adR1226

Jan 09 '24 16:01 BenjaminBossan

I’ll give demo tomorrow.

Jan 09 '24 17:01 hengjiUSTC

I have a very simple script https://github.com/hengjiUSTC/learn-llm/blob/demo_crash/trl_finetune.py run with python3 trl_finetune.py --config configs/demo-crash.yml In the configuration:

fp16 is true: https://github.com/hengjiUSTC/learn-llm/blob/demo_crash/configs/demo_crash.yml#L29
model is loaded in float16 https://github.com/hengjiUSTC/learn-llm/blob/demo_crash/configs/demo_crash.yml#L6C18-L6C18

Run1: When commenting out line https://github.com/hengjiUSTC/learn-llm/blob/demo_crash/trl_finetune.py#L388 (modules_to_save is None) training process runs correctly: 截屏2024-01-10 上午10 36 48

截屏2024-01-10 上午10 30 30

Run2: When add back line https://github.com/hengjiUSTC/learn-llm/blob/demo_crash/trl_finetune.py#L388 (modules_to_save=["embed_tokens", "lm_head"]) error is raised.

截屏2024-01-10 上午10 36 35 截屏2024-01-10 上午10 32 51

So Run1 shouldn't work if LoRA weights inherit float16 from model type?

Jan 10 '24 02:01 hengjiUSTC

Thanks for providing an example. I tried it (using opt) and it crashed even with modules_to_save=None. Checking the dtypes of the learnable parameters, they are fp16, so the crash is expected. Not sure what the source of the difference is, but either way, I think it's safe to say that when loading in fp16, it's best to cast the trainable weights to fp32. PR #1318 will introduce a convenience function cast_non_trainable_to_dtype to do this quickly.

Jan 10 '24 14:01 BenjaminBossan

Got it, thanks for checking.

Thanks for providing an example. I tried it (using opt) and it crashed even with modules_to_save=None. Checking the dtypes of the learnable parameters, they are fp16, so the crash is expected. Not sure what the source of the difference is, but either way, I think it's safe to say that when loading in fp16, it's best to cast the trainable weights to fp32. PR #1318 will introduce a convenience function cast_non_trainable_to_dtype to do this quickly.

Jan 10 '24 15:01 hengjiUSTC

peft peft copied to clipboard

modules_to_save: "ValueError: Attempting to unscale FP16 gradients"

peft
peft copied to clipboard