diffusers
diffusers copied to clipboard
[Dreambooth Example] Attempting to unscale FP16 gradients.
Describe the bug
I had the training script working fine but then I updated diffusers to 0.7.2 and now I get the following error:
Traceback (most recent call last):
File "/tmp/pycharm_project_990/train_dreambooth.py", line 938, in <module>
main(args)
File "/tmp/pycharm_project_990/train_dreambooth.py", line 876, in main
optimizer.step()
File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/accelerate/optimizer.py", line 134, in step
self.scaler.step(self.optimizer, closure)
File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in step
self.unscale_(optimizer)
File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 282, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 210, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps: 0%| | 0/800 [00:18<?, ?it/s]
Any ideas, or do I need to downgrade?
Reproduction
No response
Logs
No response
System Info
diffusers 0.7.2 python 3.7.12 accelerate 0.14.0
A bit more info. This happens on a fresh install when I set the --mixed_precision fp16
and --revision fp16
Same here, but it's my first time trying to train.
@jpiabrantes I guess it will help if one of us can bisect this (i.e. find the exact commit where the bug was introduced) :sweat_smile: What were you using before 0.7.2? 0.7.1?
Same bug, I tried the fix here by changing the pytorch source directly and set allow fp16 = True , the training went through but the model only outputted black images
https://github.com/facebookresearch/fairscale/issues/834
As per the linked issue above, I think the actual issue seems to be in PyTorch?
In PyTorch/torch/cuda/amp/grad_scaler.py#L279
we have:
def unscale_(self, optimizer):
# ...
# The final False here is the allow_fp16=False argument to _unscale_grads_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
# ...
Here's are some relevant parts from _unscale_grads_
:
def _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16):
# ...
if (not allow_fp16) and param.grad.dtype == torch.float16:
raise ValueError("Attempting to unscale FP16 gradients.")
# ...
In https://github.com/pytorch/pytorch/issues/74739 it's questioned why fp16
is disallowed.
@jpiabrantes (OP), since you had this working before, could you confirm not just the diffusers version you were using, but also PyTorch, and anything else you think might have changed? It would be really helpful if you could provide a known working configuration.
@cian0 thanks for mentioning that you already tried changing this value manually and it didn't work, saving us all some time! :pray:
To add, I also tried reverting the diffusers version, I am using Shivam's repo which is a fork of diffusers, he merged the latest version of diffusers which caused issues.
When I revert back to 15 days ago commit (when he hasn't merged yet), and downgrade my diffusers via pip, everything works again. Pytorch version never changed (v1.13.0) I also thought it was pytorch, I was supposed to downgrade pytorch if the revert solution didn't work but apparently I didn't have to.
I was using Shivam's repo as well with the fork of diffusers.
Edit: the below can all be disregarded, much more progress from my next comment
Ok thanks @jpiabrantes. I'm able to reproduce consistently in pure diffusers
. It's been excessively difficult to track down... I admit I'm quite new to Python but this all just seems crazy. I'll mention my findings so far, and maybe @patrickvonplaten can share some insight from his experience.
It seems the minute we switch from the 0.7.0.dev0
label to 0.7.0
we have this issue (and only changing the label, literally the commit that changes the label with no other code change), and then there's no going back again (you have to uninstall and reinstall a bunch of unrelated packages, otherwise commits that worked fine before no longer work). This will be clearer in these steps:
1. Create a fresh starting point
# Uninstalling all of these is the only way I can reliably "reset" the broken state
# Even though all of the versions stayed the same.
pip uninstall -r examples/dreambooth/requirements.txt
pip install accelerate torchvision ftfy
# Need this is a good starting point, otherwise other weird stuff is broken
pip install git+https://github.com/huggingface/[email protected]
2. Install the last commit that works (118c5be - "Docs: Do not require PyTorch nightlies")
$ pip install --no-cache git+https://github.com/huggingface/diffusers@118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218
$ ./train # works!
pip log
$ pip install --no-cache git+https://github.com/huggingface/diffusers@118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218 Defaulting to user installation because normal site-packages is not writeable Collecting git+https://github.com/huggingface/diffusers@118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218 Cloning https://github.com/huggingface/diffusers (to revision 118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218) to /tmp/pip-req-build-dg_2r41v Running command git clone --filter=blob:none --quiet https://github.com/huggingface/diffusers /tmp/pip-req-build-dg_2r41v Running command git rev-parse -q --verify 'sha^118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218' Running command git fetch -q https://github.com/huggingface/diffusers 118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218 Running command git checkout -q 118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218 Resolved https://github.com/huggingface/diffusers to commit 118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: regex!=2019.12.17 in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0.dev0) (2022.8.17) Requirement already satisfied: filelock in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0.dev0) (3.8.0) Requirement already satisfied: numpy in /usr/lib/python3.10/site-packages (from diffusers==0.7.0.dev0) (1.23.4) Requirement already satisfied: huggingface-hub>=0.10.0 in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0.dev0) (0.11.0) Requirement already satisfied: Pillow=5.1 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0.dev0) (6.0) Requirement already satisfied: tqdm in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0.dev0) (4.64.1) Requirement already satisfied: packaging>=20.9 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0.dev0) (21.3) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0.dev0) (4.4.0) Requirement already satisfied: zipp>=0.5 in /usr/lib/python3.10/site-packages (from importlib-metadata->diffusers==0.7.0.dev0) (3.10.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3.10/site-packages (from requests->diffusers==0.7.0.dev0) (2022.9.24) Requirement already satisfied: idna=2.5 in /home/dragon/.local/lib/python3.10/site-packages (from requests->diffusers==0.7.0.dev0) (2.10) Requirement already satisfied: chardet=3.0.2 in /home/dragon/.local/lib/python3.10/site-packages (from requests->diffusers==0.7.0.dev0) (4.0.0) Requirement already satisfied: urllib3=1.21.1 in /usr/lib/python3.10/site-packages (from requests->diffusers==0.7.0.dev0) (1.26.12) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/lib/python3.10/site-packages (from packaging>=20.9->huggingface-hub>=0.10.0->diffusers==0.7.0.dev0) (3.0.9)$ ./train Steps: 0%| | 1/400 [00:02<14:02, 2.11s/it, loss=0.135, lr=5e-6]â•â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ Traceback (most recent call last) ──────────────────────╮
This should actually be considered WORKING because training starts
OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.75 GiB total capacity; 14.26 GiB already allocated; 222.56 MiB free; 14.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=instance_images', '--output_dir=output_dir', '--instance_prompt=a photo of sks dog', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=400', '--revision=fp16', '--mixed_precision=fp16']' returned non-zero exit status 1.
3. Install the first commit that breaks (1578679 - "Release: v0.7.0")
$ pip install -no-cache 1578679ff4a4ff8157214081438aa7d78f13b4fc
$ ./train # ValueError: Attempting to unscale FP16 gradients.
pip log
$ pip install --no-cache git+https://github.com/huggingface/diffusers@1578679ff4a4ff8157214081438aa7d78f13b4fc Defaulting to user installation because normal site-packages is not writeable Collecting git+https://github.com/huggingface/diffusers@1578679ff4a4ff8157214081438aa7d78f13b4fc Cloning https://github.com/huggingface/diffusers (to revision 1578679ff4a4ff8157214081438aa7d78f13b4fc) to /tmp/pip-req-build-ce_8p_6r Running command git clone --filter=blob:none --quiet https://github.com/huggingface/diffusers /tmp/pip-req-build-ce_8p_6r Running command git rev-parse -q --verify 'sha^1578679ff4a4ff8157214081438aa7d78f13b4fc' Running command git fetch -q https://github.com/huggingface/diffusers 1578679ff4a4ff8157214081438aa7d78f13b4fc Running command git checkout -q 1578679ff4a4ff8157214081438aa7d78f13b4fc Resolved https://github.com/huggingface/diffusers to commit 1578679ff4a4ff8157214081438aa7d78f13b4fc Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: huggingface-hub>=0.10.0 in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0) (0.11.0) Requirement already satisfied: regex!=2019.12.17 in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0) (2022.8.17) Requirement already satisfied: filelock in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0) (3.8.0) Requirement already satisfied: Pillow=20.9 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0) (21.3) Requirement already satisfied: tqdm in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0) (4.64.1) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0) (4.4.0) Requirement already satisfied: pyyaml>=5.1 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0) (6.0) Requirement already satisfied: zipp>=0.5 in /usr/lib/python3.10/site-packages (from importlib-metadata->diffusers==0.7.0) (3.10.0) Requirement already satisfied: chardet=3.0.2 in /home/dragon/.local/lib/python3.10/site-packages (from requests->diffusers==0.7.0) (4.0.0) Requirement already satisfied: idna=2.5 in /home/dragon/.local/lib/python3.10/site-packages (from requests->diffusers==0.7.0) (2.10) Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3.10/site-packages (from requests->diffusers==0.7.0) (2022.9.24) Requirement already satisfied: urllib3=1.21.1 in /usr/lib/python3.10/site-packages (from requests->diffusers==0.7.0) (1.26.12) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/lib/python3.10/site-packages (from packaging>=20.9->huggingface-hub>=0.10.0->diffusers==0.7.0) (3.0.9) Building wheels for collected packages: diffusers Building wheel for diffusers (pyproject.toml) ... done Created wheel for diffusers: filename=diffusers-0.7.0-py3-none-any.whl size=305127 sha256=6e9b12cc0ee68b4250af92ced689ef3d44aadbbc5fca57acf3355797da765f91 Stored in directory: /tmp/pip-ephem-wheel-cache-lg6be8e4/wheels/d5/ad/09/71a9b17f6282e5cc00f53be606e4e230db6962308ae661308f Successfully built diffusers Installing collected packages: diffusers Attempting uninstall: diffusers Found existing installation: diffusers 0.7.0.dev0 Uninstalling diffusers-0.7.0.dev0: Successfully uninstalled diffusers-0.7.0.dev0 Successfully installed diffusers-0.7.0 (base) [dragon@dragon d2]$ ./train /home/dragon/.local/lib/python3.10/site-packages/accelerate/accelerator.py:205: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed. warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.") {'weight_dtype': torch.float16} Steps: 0%| | 0/400 [00:00, ?it/s] ValueError: Attempting to unscale FP16 gradients. Steps: 0%| | 0/400 [00:01, ?it/s] CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=instance_images', '--output_dir=output_dir', '--instance_prompt=a photo of sks dog', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=400', '--revision=fp16', '--mixed_precision=fp16']' returned non-zero exit status 1.
After this (it breaking), you can literally go back to any previous commit that worked before - which all have the version 0.7.0.dev0
- and we still get the ValueError: Attempting to unscale FP16 gradients.
. Going back to v0.6.0
works! But then going back to later commits with 0.7.0.dev0
still fails. Except if you do it in a weird funky order that I'm still figuring out, then it works (previous working commit fails, even earlier commits fails, but, go back to 0.6.0, now if you go back to one of the earlier commits, it works, and back to the previous working commit, which is now working again too :man_shrugging:).
I thought maybe it was a deps issue but pip list
before and after the problem arises remain identical (except for that first move from 0.7.0.dev0
to 0.7.0
; but move back again and the deps match and the problem remains).
There really must be some weird Python thing going on here that's just beyond me... I could guess some file is being overwritten somewhere, and the fact that a lot of the different commits still have the same 0.7.0.dev0
version is confusing things... but pip --no-cache
doesn't make any difference, even when it says it's creating a new wheel.
I hope I don't sound crazy, because working through all this definitely makes me feel crazy. I hope I've given enough info to hint to someone who knows the Python ecosystem more intimately to have an idea of what's going on. Happy to help out however else I can.
@patil-suraj could you please take a look here?
Ok I'm not sure what's up with my setup, but I had much better luck inside a docker container using git checkout
directly. My sanity has returned and I've bisected the issue to this commit:
https://github.com/huggingface/diffusers/commit/7482178162b779506a54538f2cf2565c8b88c597 default fast model loading
:tada:
And indeed, passing fast_load=False
to the unet
loader (and only the unet) is enough to get this working (with slower loads of course). On the above commit at least. On later the latest main, I get the error again. Back to bisecting :D
Ok, so, even with fast_load=False
on the unet, this breaks again in:
https://github.com/huggingface/diffusers/commit/42bb459457d77d6185f74cbc32f2a08b08876af5 [Low cpu memory] Correct naming and improve default usage
Setting low_cpu_mem_usage=False
on the unet fixes this one too, all the way up to the most recent commit on main
:
So basically, currently possible to work around this issue with:
unet = UNet2DConditionModel.from_pretrained(
args.pretrained_model_name_or_path,
subfolder="unet",
revision=args.revision,
# Add these two lines below to workaround the issue
fast_load=False,
low_cpu_mem_usage=False,
)
Ok, that's it from me for the day... sorry for all the traffic. But I think this will be very helpful for @patil-suraj :sweat_smile: And has been a great make up experience for me personally after my last attempt :sweat_smile: :sweat_smile:
Thanks for the detailed issue, taking a look now.
Okay, think I know where the issue is coming from.
The issue is that we are using fp16 weights to do mixed-precision training. When we set mixed_precision="fp16"
, accelerate uses torch.cuda.amp.autocast
to do mixed precision training, note that this is not full fp16 training.
From torch.cuda.amp.autocast
docs
When entering an autocast-enabled region, Tensors may be any type. You should not call half() or bfloat16() on your model(s) or inputs when using autocasting.
So what's happening is,
- when the weights are loaded using the fast method, the type (unless specified with
dtype
arg) is that of the saved params, i.e in case ofrevision=fp16
it isfp16
. - When the weights are loaded using the slow method, the weights are always
fp32
(unless specified withdtype
arg).
That's why we get the above error with revision=fp16
and mixed_precision="fp16"
.
to verify
from diffusers import UNet2DConditionModel
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", revision="fp16", low_cpu_mem_usage=False)
unet2 = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", revision="fp16", low_cpu_mem_usage=True)
print(unet.dtype) # torch.float32
print(unet2.dtype) # torch.float16
@patrickvonplaten think this should be fixed in the modeling_utils
, should we make sure that for both methods, the weights will have similar dtype
?
Also, @gadicc @jpiabrantes ,
fast_load
is not a valid arg name, it's changed to low_cpu_mem_usage
, so we should not set fast_load
. Setting low_cpu_mem_usage=False
would be a good temporary solution.
Also, since we are doing mixed-precision training here, I would not recommend using the fp16
weights for training.
Ah brilliant, @patil-suraj, thanks so much! All makes total sense (and I guess I should have read that second commit a bit more carefully :sweat_smile:, so thanks for clarifying). I'll leave the final call on which training weights to use to my users, but I think it will boil down to memory / speed / xformers (I got an xformers internal error trying to train with fp32, haven't tried yet on fp16 but it looked related. I'll have a chance to look into that properly tomorrow). Thanks so much for the quick turnaround! And noted that we can use fp32 weights to train with fp16 mixed precision, thanks!
I got an xformers internal error trying to train with fp32
what was the error ? I've been using xformers
a lot for training, and only with fp32 and it works perfectly in my setup.
Thanks a lot for the nice repo @patil-suraj ! Let's fix this indeed :-)
Here a PR to fix it: https://github.com/huggingface/diffusers/pull/1449
ran into this also!
Is this fixed now after #1449?
Hi all, sorry for the radio silence... some time sensitive matters snuck up on me. I hope one of the other contributors to this issue can confirm the fix, otherwise I hope to have a chance to try this out on Sunday and promise to report back after.
Thank you both @patil-suraj and @patrickvonplaten for your amazing and quick work here! (And patil-suraj, thanks, I indeed got dreambooth working with fp32 too, it kind of fixed itself but I think I had been loading one of the components with an incompatible model).
:pray:
No worries! If you could confirm this would be nice, but no problem at all if you don't find the time!
@patrickvonplaten thanks for the understanding and patience :pray:
Ok finally had a chance to try this out.
Unfortunately I'm still getting the same error :sweat_smile:
Interestingly enough, with
20ce68f945de7860f9854cd7ee680debf4a07fe5 Fix dtype model loading #1449
applied, the low_cpu_mem_usage=False
workaround stops working too.
This is how I'm launching:
#!/bin/sh
export MODEL_NAME="CompVis/stable-diffusion-v1-4" # <--
export INSTANCE_DIR="instance_images"
export OUTPUT_DIR="output_dir"
accelerate launch ./train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=1 \
--revision="fp16" \ # <--
--mixed_precision="fp16" \ # <--
--use_8bit_adam \
@gadicc #1499 won't really fix the issue that you are having, with #1499 the dtype
of the loaded model will be similar to it's saved dtype
no matter the loading method. So if we are loading the weights from fp16
branch, then after loading the model the weights will still be fp16
and that doesn't play well with mixed precision training.
For training I would recommend to always use full-precision weights.
@patrickvonplaten To actually fix this, should we always cast the weights of trainable models to fp32
before starting training, or is it good to let it fail ? IMO fp16
weights can create instability issue during training, espically for large training runs.
Thanks, all. With that in mind, I'm going to abandon this option and advise my users accordingly. I think fp32
weights with fp16
mixed_precision training is fine. Big thanks for all the clarifications and especially for sharing your (negative) experience when training with fp16
weights.
Depending on which route you take, I'd suggest:
- If casting to
fp32
(which is possibly better than failing), also show a warning explaining why this is ill-advised. - If not, rather than failing with the current error, show a more helpful error saying this use-case is not supported (and suggesting instead to use
fp32
weights withfp16
mixed_precision).
Happy to do a PR for option 2 by the end of the week if that's the chosen direction. Not confident enough (yet) to help with option 1 :)
From my side, we should just let the user have total control over the training example and follow our usual PyTorch-like API/logic.
This means:
- By default, we should use the highest precision, least optimized training options
- We allow all kinds of optimized training options (mixed precision, xformers, ....) and let the script fail if something is not done correctly
Also cc @pcuenca @williamberman
- By default, we should use the highest precision, least optimized training options
- We allow all kinds of optimized training options (mixed precision, xformers, ....) and let the script fail if something is not done correctly
@patrickvonplaten when you say allow the script to fail do you mean we throw an error when training with fp16 + amp or we let it train and just have bad outputs? I think letting the script train makes sense but would like if we logged a warning :)
Yeah, by "let it fail" I mean to throw a nice error to the user so that the user has instant feedback that something wasn't done correctly :-)
Think it's never a good idea to "let the user train and have bad outputs"
Ok, I think I follow everything
tl;dr: training/fine tuning shouldn't be done with fp16 weights[^1], fp16 inputs are ok with amp + gradient scaling. fp16 weights throw an error when used with amp + gradient scaling. We should check the dtype of the loaded model and throw an informative error before training begins.
I can put up a PR for this in the morning
[^1]: precision issues when adding small gradient updates to fp16 weights. Reason why training with amp recommends to keep weights as fp32 for gradient updates and makes a copy in half precision for forward and backward passes.
Since we merged the guard on loading low precision weights, going to close this issue :)
I got simillar error when training LLMs, aam using float16 loaded a LLaMa model, but training simply got this error