diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

train_text_to_image_lora.py raise ValueError("Attempting to unscale FP16 gradients.")

Open billvsme opened this issue 1 year ago • 10 comments

Describe the bug

When looking at the examples/text_to_image documentation, I experimented with the train_text_to_image_lora.py following the examples in the documentation. But I found that the run with raise ValueError("Attempting to unscale FP16 gradients.") error.

I found that the cause of the error may be related to this code. Here use args.mixed_precision to determine whether to convert Lora's parameters to float32, but args.mixed_precision default value is None, according to the example in README, the mixed_precision of accelerate is set, and it is not set args.mixed_ precision, so it causes "Attempting to unscale FP16 gradients." error. https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L468-L472

It might be a better choice to change this to use accelerator.mixed_precision

Reproduction

cd diffusers/examples/text_to_image/

accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
  --dataset_name="lambdalabs/pokemon-blip-captions" --caption_column="text" \
  --resolution=512 --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=100 --checkpointing_steps=5000 \
  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora" \
  --validation_prompt="cute dragon creature"

Logs

Steps:   0%|                                          | 0/20900 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 945, in <module>
Traceback (most recent call last):
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 945, in <module>
    main()
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 774, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    main()
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 774, in main
Traceback (most recent call last):
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 945, in <module>
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    main()
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 774, in main
    self.unscale_gradients()
    self.scaler.unscale_(opt)
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
    self.scaler.unscale_(opt)
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    self.unscale_gradients()
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    raise ValueError("Attempting to unscale FP16 gradients.")

System Info

  • diffusers version: 0.25.0.dev0
  • Platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.35
  • Python version: 3.10.13
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Huggingface_hub version: 0.19.4
  • Transformers version: 4.36.2
  • Accelerate version: 0.25.0
  • xFormers version: 0.0.22.post7
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@sayakpaul

billvsme avatar Dec 27 '23 19:12 billvsme

A better way would be to assign args.mixed_precision from accelerator.mixed_precision.

However, when you initialize an Accelerator object you pass the value from args.mixed_precision itself:

https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L385

So, passing mixed_precision to your CLI args is recommended.

sayakpaul avatar Dec 28 '23 01:12 sayakpaul

@sayakpaul 👌,thanks

But I found one that was different from train_text_to_image.py and train_text_to_image_lora.py, train_text_to_image_lora.py didn't reassign the args.mixed_precision. In this way, if you specify accelerate launch --mixed_precision="fp16" in the accelerator, you need to add the same --mixed_precision="fp16" to the CLI args . Only in this way will there be no error, like is

accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
  --mixed_precision="fp16" \
  ......

train_text_to_image.py: https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image.py#L811-L816

train_text_to_image_lora.py: https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L444-L448

billvsme avatar Dec 28 '23 02:12 billvsme

Maybe the example in the docs needs to be updated

https://github.com/huggingface/diffusers/tree/main/examples/text_to_image 截屏2023-12-28 10 24 35

billvsme avatar Dec 28 '23 02:12 billvsme

Should be fixed with: https://github.com/huggingface/diffusers/issues/6388. Could you pull the changes and try again? :)

sayakpaul avatar Jan 05 '24 02:01 sayakpaul

Hi @sayakpaul , The problem with running train_text_to_image_lora.py still persists for me. I have pulled the latest changes from the GitHub repo.

AfrinaVT avatar Jan 29 '24 14:01 AfrinaVT

Could you maybe refer to https://github.com/huggingface/diffusers/issues/6552 and open a PR?

sayakpaul avatar Jan 29 '24 14:01 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 22 '24 15:02 github-actions[bot]

can we close this one now?

yiyixuxu avatar Feb 23 '24 21:02 yiyixuxu

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 27 '24 15:03 github-actions[bot]

I encountered the same issue on diffusers==0.30.0.dev0. The additional CLI args works on this version as well.

blueclowd avatar Jun 30 '24 07:06 blueclowd

Just encountered this issue. Not stale.

lino-levan avatar Jul 10 '24 01:07 lino-levan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 09 '24 15:10 github-actions[bot]

Assuming it's a duplicate of https://github.com/huggingface/diffusers/issues/8871 I am closing this.

sayakpaul avatar Oct 09 '24 15:10 sayakpaul

@sayakpaul 👌,thanks

But I found one that was different from train_text_to_image.py and train_text_to_image_lora.py, train_text_to_image_lora.py didn't reassign the args.mixed_precision. In this way, if you specify accelerate launch --mixed_precision="fp16" in the accelerator, you need to add the same --mixed_precision="fp16" to the CLI args . Only in this way will there be no error, like is

accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
  --mixed_precision="fp16" \
  ......

train_text_to_image.py:

https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image.py#L811-L816

train_text_to_image_lora.py:

https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L444-L448

Hi, I also encountered the same issue. The work around is either to specify the --mixed_precision="fp16" in the CMD or add args.mixed_precision = accelerator.mixed_precision to

if accelerator.mixed_precision == "fp16":
    weight_dtype = torch.float16
    args.mixed_precision = accelerator.mixed_precision
elif accelerator.mixed_precision == "bf16":
    weight_dtype = torch.bfloat16
    args.mixed_precision = accelerator.mixed_precision

I hope all these code such as train_text_to_image.py and train_text_to_image_lora.py can by synchronized and it could be much more friendly for new comers.

YuyangXueEd avatar Oct 11 '24 10:10 YuyangXueEd

Encountered it, even with the prefix and the new code, but after resuming a checkpoint it works. Pls help

KreakxX avatar Apr 11 '25 14:04 KreakxX

accelerate launch --mixed_precision="fp16" train_dreambooth_lora.py --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" --instance_data_dir="C:/Users/Henri/logo-generator/Dataset/images" --output_dir="C:/Users/Henri/logo-lora-output" --resolution=512 --train_batch_size=1 --mixed_precision="fp16" --gradient_accumulation_steps=1 --learning_rate=3e-5 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=2500 --checkpointing_steps=750 --validation_prompt="a henrikstyle logo for a tech company, minimal flat design, white background" --instance_prompt="a logo in henrikstyle" --seed=42 --gradient_checkpointing This is my Script that iam running

KreakxX avatar Apr 11 '25 14:04 KreakxX