diffusers
diffusers copied to clipboard
train_text_to_image_lora.py raise ValueError("Attempting to unscale FP16 gradients.")
Describe the bug
When looking at the examples/text_to_image documentation, I experimented with the train_text_to_image_lora.py following the examples in the documentation. But I found that the run with raise ValueError("Attempting to unscale FP16 gradients.") error.
I found that the cause of the error may be related to this code. Here use args.mixed_precision to determine whether to convert Lora's parameters to float32, but args.mixed_precision default value is None, according to the example in README, the mixed_precision of accelerate is set, and it is not set args.mixed_ precision, so it causes "Attempting to unscale FP16 gradients." error. https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L468-L472
It might be a better choice to change this to use accelerator.mixed_precision
Reproduction
cd diffusers/examples/text_to_image/
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--dataset_name="lambdalabs/pokemon-blip-captions" --caption_column="text" \
--resolution=512 --random_flip \
--train_batch_size=1 \
--num_train_epochs=100 --checkpointing_steps=5000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--seed=42 \
--output_dir="sd-pokemon-model-lora" \
--validation_prompt="cute dragon creature"
Logs
Steps: 0%| | 0/20900 [00:00<?, ?it/s]Traceback (most recent call last):
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 945, in <module>
Traceback (most recent call last):
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 945, in <module>
main()
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 774, in main
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
main()
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 774, in main
Traceback (most recent call last):
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 945, in <module>
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
main()
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 774, in main
self.unscale_gradients()
self.scaler.unscale_(opt)
File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
self.scaler.unscale_(opt)
File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
self.unscale_gradients()
File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
raise ValueError("Attempting to unscale FP16 gradients.")
System Info
diffusersversion: 0.25.0.dev0- Platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- PyTorch version (GPU?): 2.1.2+cu121 (True)
- Huggingface_hub version: 0.19.4
- Transformers version: 4.36.2
- Accelerate version: 0.25.0
- xFormers version: 0.0.22.post7
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
@sayakpaul
A better way would be to assign args.mixed_precision from accelerator.mixed_precision.
However, when you initialize an Accelerator object you pass the value from args.mixed_precision itself:
https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L385
So, passing mixed_precision to your CLI args is recommended.
@sayakpaul 👌,thanks
But I found one that was different from train_text_to_image.py and train_text_to_image_lora.py, train_text_to_image_lora.py didn't reassign the args.mixed_precision. In this way, if you specify accelerate launch --mixed_precision="fp16" in the accelerator, you need to add the same --mixed_precision="fp16" to the CLI args . Only in this way will there be no error, like is
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
--mixed_precision="fp16" \
......
train_text_to_image.py: https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image.py#L811-L816
train_text_to_image_lora.py: https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L444-L448
Maybe the example in the docs needs to be updated
https://github.com/huggingface/diffusers/tree/main/examples/text_to_image
Should be fixed with: https://github.com/huggingface/diffusers/issues/6388. Could you pull the changes and try again? :)
Hi @sayakpaul , The problem with running train_text_to_image_lora.py still persists for me. I have pulled the latest changes from the GitHub repo.
Could you maybe refer to https://github.com/huggingface/diffusers/issues/6552 and open a PR?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
can we close this one now?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I encountered the same issue on diffusers==0.30.0.dev0. The additional CLI args works on this version as well.
Just encountered this issue. Not stale.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Assuming it's a duplicate of https://github.com/huggingface/diffusers/issues/8871 I am closing this.
@sayakpaul 👌,thanks
But I found one that was different from train_text_to_image.py and train_text_to_image_lora.py, train_text_to_image_lora.py didn't reassign the args.mixed_precision. In this way, if you specify
accelerate launch --mixed_precision="fp16"in the accelerator, you need to add the same --mixed_precision="fp16" to the CLI args . Only in this way will there be no error, like isaccelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \ --mixed_precision="fp16" \ ......train_text_to_image.py:
https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image.py#L811-L816
train_text_to_image_lora.py:
https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L444-L448
Hi, I also encountered the same issue. The work around is either to specify the --mixed_precision="fp16" in the CMD or add args.mixed_precision = accelerator.mixed_precision to
if accelerator.mixed_precision == "fp16":
weight_dtype = torch.float16
args.mixed_precision = accelerator.mixed_precision
elif accelerator.mixed_precision == "bf16":
weight_dtype = torch.bfloat16
args.mixed_precision = accelerator.mixed_precision
I hope all these code such as train_text_to_image.py and train_text_to_image_lora.py can by synchronized and it could be much more friendly for new comers.
Encountered it, even with the prefix and the new code, but after resuming a checkpoint it works. Pls help
accelerate launch --mixed_precision="fp16" train_dreambooth_lora.py --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" --instance_data_dir="C:/Users/Henri/logo-generator/Dataset/images" --output_dir="C:/Users/Henri/logo-lora-output" --resolution=512 --train_batch_size=1 --mixed_precision="fp16" --gradient_accumulation_steps=1 --learning_rate=3e-5 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=2500 --checkpointing_steps=750 --validation_prompt="a henrikstyle logo for a tech company, minimal flat design, white background" --instance_prompt="a logo in henrikstyle" --seed=42 --gradient_checkpointing This is my Script that iam running