diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

[Dreambooth Example] Attempting to unscale FP16 gradients.

Open jpiabrantes opened this issue 2 years ago • 18 comments

Describe the bug

I had the training script working fine but then I updated diffusers to 0.7.2 and now I get the following error:

Traceback (most recent call last):
  File "/tmp/pycharm_project_990/train_dreambooth.py", line 938, in <module>
    main(args)
  File "/tmp/pycharm_project_990/train_dreambooth.py", line 876, in main
    optimizer.step()
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/accelerate/optimizer.py", line 134, in step
    self.scaler.step(self.optimizer, closure)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in step
    self.unscale_(optimizer)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 282, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 210, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps:   0%|          | 0/800 [00:18<?, ?it/s]

Any ideas, or do I need to downgrade?

Reproduction

No response

Logs

No response

System Info

diffusers 0.7.2 python 3.7.12 accelerate 0.14.0

jpiabrantes avatar Nov 10 '22 16:11 jpiabrantes

A bit more info. This happens on a fresh install when I set the --mixed_precision fp16 and --revision fp16

jpiabrantes avatar Nov 11 '22 16:11 jpiabrantes

Same here, but it's my first time trying to train.

@jpiabrantes I guess it will help if one of us can bisect this (i.e. find the exact commit where the bug was introduced) :sweat_smile: What were you using before 0.7.2? 0.7.1?

gadicc avatar Nov 13 '22 11:11 gadicc

Same bug, I tried the fix here by changing the pytorch source directly and set allow fp16 = True , the training went through but the model only outputted black images

https://github.com/facebookresearch/fairscale/issues/834

cian0 avatar Nov 16 '22 08:11 cian0

As per the linked issue above, I think the actual issue seems to be in PyTorch?

In PyTorch/torch/cuda/amp/grad_scaler.py#L279 we have:

def unscale_(self, optimizer):
        # ...
        # The final False here is the allow_fp16=False argument to _unscale_grads_
        optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
        # ...

Here's are some relevant parts from _unscale_grads_:

    def _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16):
        # ...
                    if (not allow_fp16) and param.grad.dtype == torch.float16:
                        raise ValueError("Attempting to unscale FP16 gradients.")
        # ...

In https://github.com/pytorch/pytorch/issues/74739 it's questioned why fp16 is disallowed.

@jpiabrantes (OP), since you had this working before, could you confirm not just the diffusers version you were using, but also PyTorch, and anything else you think might have changed? It would be really helpful if you could provide a known working configuration.

@cian0 thanks for mentioning that you already tried changing this value manually and it didn't work, saving us all some time! :pray:

gadicc avatar Nov 16 '22 15:11 gadicc

To add, I also tried reverting the diffusers version, I am using Shivam's repo which is a fork of diffusers, he merged the latest version of diffusers which caused issues.

When I revert back to 15 days ago commit (when he hasn't merged yet), and downgrade my diffusers via pip, everything works again. Pytorch version never changed (v1.13.0) I also thought it was pytorch, I was supposed to downgrade pytorch if the revert solution didn't work but apparently I didn't have to.

cian0 avatar Nov 17 '22 03:11 cian0

I was using Shivam's repo as well with the fork of diffusers.

jpiabrantes avatar Nov 17 '22 10:11 jpiabrantes

Edit: the below can all be disregarded, much more progress from my next comment

Ok thanks @jpiabrantes. I'm able to reproduce consistently in pure diffusers. It's been excessively difficult to track down... I admit I'm quite new to Python but this all just seems crazy. I'll mention my findings so far, and maybe @patrickvonplaten can share some insight from his experience.

It seems the minute we switch from the 0.7.0.dev0 label to 0.7.0 we have this issue (and only changing the label, literally the commit that changes the label with no other code change), and then there's no going back again (you have to uninstall and reinstall a bunch of unrelated packages, otherwise commits that worked fine before no longer work). This will be clearer in these steps:

1. Create a fresh starting point

# Uninstalling all of these is the only way I can reliably "reset" the broken state
# Even though all of the versions stayed the same.
pip uninstall -r examples/dreambooth/requirements.txt
pip install accelerate torchvision ftfy

# Need this is a good starting point, otherwise other weird stuff is broken
pip install git+https://github.com/huggingface/[email protected]

2. Install the last commit that works (118c5be - "Docs: Do not require PyTorch nightlies")

$ pip install --no-cache git+https://github.com/huggingface/diffusers@118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218
$ ./train # works!
pip log
$ pip install --no-cache git+https://github.com/huggingface/diffusers@118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/huggingface/diffusers@118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218
  Cloning https://github.com/huggingface/diffusers (to revision 118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218) to /tmp/pip-req-build-dg_2r41v
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/diffusers /tmp/pip-req-build-dg_2r41v
  Running command git rev-parse -q --verify 'sha^118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218'
  Running command git fetch -q https://github.com/huggingface/diffusers 118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218
  Running command git checkout -q 118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218
  Resolved https://github.com/huggingface/diffusers to commit 118c5be94a2b8eb90fa41a2ceb59b3a8de9e0218
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: regex!=2019.12.17 in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0.dev0) (2022.8.17)
Requirement already satisfied: filelock in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0.dev0) (3.8.0)
Requirement already satisfied: numpy in /usr/lib/python3.10/site-packages (from diffusers==0.7.0.dev0) (1.23.4)
Requirement already satisfied: huggingface-hub>=0.10.0 in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0.dev0) (0.11.0)
Requirement already satisfied: Pillow=5.1 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0.dev0) (6.0)
Requirement already satisfied: tqdm in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0.dev0) (4.64.1)
Requirement already satisfied: packaging>=20.9 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0.dev0) (21.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0.dev0) (4.4.0)
Requirement already satisfied: zipp>=0.5 in /usr/lib/python3.10/site-packages (from importlib-metadata->diffusers==0.7.0.dev0) (3.10.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3.10/site-packages (from requests->diffusers==0.7.0.dev0) (2022.9.24)
Requirement already satisfied: idna=2.5 in /home/dragon/.local/lib/python3.10/site-packages (from requests->diffusers==0.7.0.dev0) (2.10)
Requirement already satisfied: chardet=3.0.2 in /home/dragon/.local/lib/python3.10/site-packages (from requests->diffusers==0.7.0.dev0) (4.0.0)
Requirement already satisfied: urllib3=1.21.1 in /usr/lib/python3.10/site-packages (from requests->diffusers==0.7.0.dev0) (1.26.12)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/lib/python3.10/site-packages (from packaging>=20.9->huggingface-hub>=0.10.0->diffusers==0.7.0.dev0) (3.0.9)

$ ./train Steps: 0%| | 1/400 [00:02<14:02, 2.11s/it, loss=0.135, lr=5e-6]╭───────────────────── Traceback (most recent call last) ──────────────────────╮

This should actually be considered WORKING because training starts

OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.75 GiB total capacity; 14.26 GiB already allocated; 222.56 MiB free; 14.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=instance_images', '--output_dir=output_dir', '--instance_prompt=a photo of sks dog', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=400', '--revision=fp16', '--mixed_precision=fp16']' returned non-zero exit status 1.

3. Install the first commit that breaks (1578679 - "Release: v0.7.0")

$ pip install -no-cache 1578679ff4a4ff8157214081438aa7d78f13b4fc
$ ./train # ValueError: Attempting to unscale FP16 gradients.
pip log
$ pip install --no-cache git+https://github.com/huggingface/diffusers@1578679ff4a4ff8157214081438aa7d78f13b4fc
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/huggingface/diffusers@1578679ff4a4ff8157214081438aa7d78f13b4fc
  Cloning https://github.com/huggingface/diffusers (to revision 1578679ff4a4ff8157214081438aa7d78f13b4fc) to /tmp/pip-req-build-ce_8p_6r
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/diffusers /tmp/pip-req-build-ce_8p_6r
  Running command git rev-parse -q --verify 'sha^1578679ff4a4ff8157214081438aa7d78f13b4fc'
  Running command git fetch -q https://github.com/huggingface/diffusers 1578679ff4a4ff8157214081438aa7d78f13b4fc
  Running command git checkout -q 1578679ff4a4ff8157214081438aa7d78f13b4fc
  Resolved https://github.com/huggingface/diffusers to commit 1578679ff4a4ff8157214081438aa7d78f13b4fc
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: huggingface-hub>=0.10.0 in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0) (0.11.0)
Requirement already satisfied: regex!=2019.12.17 in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0) (2022.8.17)
Requirement already satisfied: filelock in /home/dragon/.local/lib/python3.10/site-packages (from diffusers==0.7.0) (3.8.0)
Requirement already satisfied: Pillow=20.9 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0) (21.3)
Requirement already satisfied: tqdm in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0) (4.64.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0) (4.4.0)
Requirement already satisfied: pyyaml>=5.1 in /usr/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->diffusers==0.7.0) (6.0)
Requirement already satisfied: zipp>=0.5 in /usr/lib/python3.10/site-packages (from importlib-metadata->diffusers==0.7.0) (3.10.0)
Requirement already satisfied: chardet=3.0.2 in /home/dragon/.local/lib/python3.10/site-packages (from requests->diffusers==0.7.0) (4.0.0)
Requirement already satisfied: idna=2.5 in /home/dragon/.local/lib/python3.10/site-packages (from requests->diffusers==0.7.0) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3.10/site-packages (from requests->diffusers==0.7.0) (2022.9.24)
Requirement already satisfied: urllib3=1.21.1 in /usr/lib/python3.10/site-packages (from requests->diffusers==0.7.0) (1.26.12)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/lib/python3.10/site-packages (from packaging>=20.9->huggingface-hub>=0.10.0->diffusers==0.7.0) (3.0.9)
Building wheels for collected packages: diffusers
  Building wheel for diffusers (pyproject.toml) ... done
  Created wheel for diffusers: filename=diffusers-0.7.0-py3-none-any.whl size=305127 sha256=6e9b12cc0ee68b4250af92ced689ef3d44aadbbc5fca57acf3355797da765f91
  Stored in directory: /tmp/pip-ephem-wheel-cache-lg6be8e4/wheels/d5/ad/09/71a9b17f6282e5cc00f53be606e4e230db6962308ae661308f
Successfully built diffusers
Installing collected packages: diffusers
  Attempting uninstall: diffusers
    Found existing installation: diffusers 0.7.0.dev0
    Uninstalling diffusers-0.7.0.dev0:
      Successfully uninstalled diffusers-0.7.0.dev0
Successfully installed diffusers-0.7.0
(base) [dragon@dragon d2]$ ./train
/home/dragon/.local/lib/python3.10/site-packages/accelerate/accelerator.py:205: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
{'weight_dtype': torch.float16}
Steps:   0%|                                            | 0/400 [00:00, ?it/s]
ValueError: Attempting to unscale FP16 gradients.
Steps:   0%|                                            | 0/400 [00:01, ?it/s]
CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth.py', 
'--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', 
'--instance_data_dir=instance_images', '--output_dir=output_dir', 
'--instance_prompt=a photo of sks dog', '--resolution=512', 
'--train_batch_size=1', '--gradient_accumulation_steps=1', 
'--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', 
'--max_train_steps=400', '--revision=fp16', '--mixed_precision=fp16']' returned 
non-zero exit status 1.

After this (it breaking), you can literally go back to any previous commit that worked before - which all have the version 0.7.0.dev0 - and we still get the ValueError: Attempting to unscale FP16 gradients.. Going back to v0.6.0 works! But then going back to later commits with 0.7.0.dev0 still fails. Except if you do it in a weird funky order that I'm still figuring out, then it works (previous working commit fails, even earlier commits fails, but, go back to 0.6.0, now if you go back to one of the earlier commits, it works, and back to the previous working commit, which is now working again too :man_shrugging:).

I thought maybe it was a deps issue but pip list before and after the problem arises remain identical (except for that first move from 0.7.0.dev0 to 0.7.0; but move back again and the deps match and the problem remains).

There really must be some weird Python thing going on here that's just beyond me... I could guess some file is being overwritten somewhere, and the fact that a lot of the different commits still have the same 0.7.0.dev0 version is confusing things... but pip --no-cache doesn't make any difference, even when it says it's creating a new wheel.

I hope I don't sound crazy, because working through all this definitely makes me feel crazy. I hope I've given enough info to hint to someone who knows the Python ecosystem more intimately to have an idea of what's going on. Happy to help out however else I can.

gadicc avatar Nov 17 '22 13:11 gadicc

@patil-suraj could you please take a look here?

patrickvonplaten avatar Nov 20 '22 18:11 patrickvonplaten

Ok I'm not sure what's up with my setup, but I had much better luck inside a docker container using git checkout directly. My sanity has returned and I've bisected the issue to this commit:

https://github.com/huggingface/diffusers/commit/7482178162b779506a54538f2cf2565c8b88c597 default fast model loading

:tada:

gadicc avatar Nov 21 '22 13:11 gadicc

And indeed, passing fast_load=False to the unet loader (and only the unet) is enough to get this working (with slower loads of course). On the above commit at least. On later the latest main, I get the error again. Back to bisecting :D

gadicc avatar Nov 21 '22 14:11 gadicc

Ok, so, even with fast_load=False on the unet, this breaks again in:

https://github.com/huggingface/diffusers/commit/42bb459457d77d6185f74cbc32f2a08b08876af5 [Low cpu memory] Correct naming and improve default usage

Setting low_cpu_mem_usage=False on the unet fixes this one too, all the way up to the most recent commit on main:

So basically, currently possible to work around this issue with:

    unet = UNet2DConditionModel.from_pretrained(
        args.pretrained_model_name_or_path,
        subfolder="unet",
        revision=args.revision,
        # Add these two lines below to workaround the issue
        fast_load=False,
        low_cpu_mem_usage=False,
    )

Ok, that's it from me for the day... sorry for all the traffic. But I think this will be very helpful for @patil-suraj :sweat_smile: And has been a great make up experience for me personally after my last attempt :sweat_smile: :sweat_smile:

gadicc avatar Nov 21 '22 14:11 gadicc

Thanks for the detailed issue, taking a look now.

patil-suraj avatar Nov 21 '22 14:11 patil-suraj

Okay, think I know where the issue is coming from.

The issue is that we are using fp16 weights to do mixed-precision training. When we set mixed_precision="fp16", accelerate uses torch.cuda.amp.autocast to do mixed precision training, note that this is not full fp16 training.

From torch.cuda.amp.autocast docs

When entering an autocast-enabled region, Tensors may be any type. You should not call half() or bfloat16() on your model(s) or inputs when using autocasting.

So what's happening is,

  • when the weights are loaded using the fast method, the type (unless specified with dtype arg) is that of the saved params, i.e in case of revision=fp16 it is fp16.
  • When the weights are loaded using the slow method, the weights are always fp32 (unless specified with dtype arg).

That's why we get the above error with revision=fp16 and mixed_precision="fp16".

to verify

from diffusers import UNet2DConditionModel
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", revision="fp16", low_cpu_mem_usage=False)
unet2 = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", revision="fp16", low_cpu_mem_usage=True)
print(unet.dtype) # torch.float32
print(unet2.dtype) # torch.float16

@patrickvonplaten think this should be fixed in the modeling_utils, should we make sure that for both methods, the weights will have similar dtype ?

Also, @gadicc @jpiabrantes , fast_load is not a valid arg name, it's changed to low_cpu_mem_usage, so we should not set fast_load. Setting low_cpu_mem_usage=False would be a good temporary solution.

Also, since we are doing mixed-precision training here, I would not recommend using the fp16 weights for training.

patil-suraj avatar Nov 21 '22 15:11 patil-suraj

Ah brilliant, @patil-suraj, thanks so much! All makes total sense (and I guess I should have read that second commit a bit more carefully :sweat_smile:, so thanks for clarifying). I'll leave the final call on which training weights to use to my users, but I think it will boil down to memory / speed / xformers (I got an xformers internal error trying to train with fp32, haven't tried yet on fp16 but it looked related. I'll have a chance to look into that properly tomorrow). Thanks so much for the quick turnaround! And noted that we can use fp32 weights to train with fp16 mixed precision, thanks!

gadicc avatar Nov 21 '22 15:11 gadicc

I got an xformers internal error trying to train with fp32

what was the error ? I've been using xformers a lot for training, and only with fp32 and it works perfectly in my setup.

patil-suraj avatar Nov 21 '22 16:11 patil-suraj

Thanks a lot for the nice repo @patil-suraj ! Let's fix this indeed :-)

patrickvonplaten avatar Nov 28 '22 11:11 patrickvonplaten

Here a PR to fix it: https://github.com/huggingface/diffusers/pull/1449

patrickvonplaten avatar Nov 28 '22 11:11 patrickvonplaten

ran into this also!

88stacks avatar Nov 28 '22 18:11 88stacks

Is this fixed now after #1449?

patrickvonplaten avatar Dec 01 '22 15:12 patrickvonplaten

Hi all, sorry for the radio silence... some time sensitive matters snuck up on me. I hope one of the other contributors to this issue can confirm the fix, otherwise I hope to have a chance to try this out on Sunday and promise to report back after.

Thank you both @patil-suraj and @patrickvonplaten for your amazing and quick work here! (And patil-suraj, thanks, I indeed got dreambooth working with fp32 too, it kind of fixed itself but I think I had been loading one of the components with an incompatible model).

:pray:

gadicc avatar Dec 01 '22 16:12 gadicc

No worries! If you could confirm this would be nice, but no problem at all if you don't find the time!

patrickvonplaten avatar Dec 02 '22 16:12 patrickvonplaten

@patrickvonplaten thanks for the understanding and patience :pray:

Ok finally had a chance to try this out.

Unfortunately I'm still getting the same error :sweat_smile:

Interestingly enough, with

20ce68f945de7860f9854cd7ee680debf4a07fe5 Fix dtype model loading #1449

applied, the low_cpu_mem_usage=False workaround stops working too.

This is how I'm launching:

#!/bin/sh

export MODEL_NAME="CompVis/stable-diffusion-v1-4" # <--
export INSTANCE_DIR="instance_images"
export OUTPUT_DIR="output_dir"

accelerate launch ./train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1 \
  --revision="fp16" \ # <--
  --mixed_precision="fp16" \ # <--
  --use_8bit_adam \

gadicc avatar Dec 04 '22 07:12 gadicc

@gadicc #1499 won't really fix the issue that you are having, with #1499 the dtype of the loaded model will be similar to it's saved dtype no matter the loading method. So if we are loading the weights from fp16 branch, then after loading the model the weights will still be fp16 and that doesn't play well with mixed precision training.

For training I would recommend to always use full-precision weights.

@patrickvonplaten To actually fix this, should we always cast the weights of trainable models to fp32 before starting training, or is it good to let it fail ? IMO fp16 weights can create instability issue during training, espically for large training runs.

patil-suraj avatar Dec 05 '22 15:12 patil-suraj

Thanks, all. With that in mind, I'm going to abandon this option and advise my users accordingly. I think fp32 weights with fp16 mixed_precision training is fine. Big thanks for all the clarifications and especially for sharing your (negative) experience when training with fp16 weights.

Depending on which route you take, I'd suggest:

  • If casting to fp32 (which is possibly better than failing), also show a warning explaining why this is ill-advised.
  • If not, rather than failing with the current error, show a more helpful error saying this use-case is not supported (and suggesting instead to use fp32 weights with fp16 mixed_precision).

Happy to do a PR for option 2 by the end of the week if that's the chosen direction. Not confident enough (yet) to help with option 1 :)

gadicc avatar Dec 05 '22 16:12 gadicc

From my side, we should just let the user have total control over the training example and follow our usual PyTorch-like API/logic.

This means:

  • By default, we should use the highest precision, least optimized training options
  • We allow all kinds of optimized training options (mixed precision, xformers, ....) and let the script fail if something is not done correctly

Also cc @pcuenca @williamberman

patrickvonplaten avatar Dec 12 '22 08:12 patrickvonplaten

  • By default, we should use the highest precision, least optimized training options
  • We allow all kinds of optimized training options (mixed precision, xformers, ....) and let the script fail if something is not done correctly

@patrickvonplaten when you say allow the script to fail do you mean we throw an error when training with fp16 + amp or we let it train and just have bad outputs? I think letting the script train makes sense but would like if we logged a warning :)

williamberman avatar Dec 20 '22 19:12 williamberman

Yeah, by "let it fail" I mean to throw a nice error to the user so that the user has instant feedback that something wasn't done correctly :-)

Think it's never a good idea to "let the user train and have bad outputs"

patrickvonplaten avatar Jan 02 '23 13:01 patrickvonplaten

Ok, I think I follow everything

tl;dr: training/fine tuning shouldn't be done with fp16 weights[^1], fp16 inputs are ok with amp + gradient scaling. fp16 weights throw an error when used with amp + gradient scaling. We should check the dtype of the loaded model and throw an informative error before training begins.

I can put up a PR for this in the morning

[^1]: precision issues when adding small gradient updates to fp16 weights. Reason why training with amp recommends to keep weights as fp32 for gradient updates and makes a copy in half precision for forward and backward passes.

williamberman avatar Jan 03 '23 05:01 williamberman

Since we merged the guard on loading low precision weights, going to close this issue :)

williamberman avatar Jan 13 '23 10:01 williamberman

I got simillar error when training LLMs, aam using float16 loaded a LLaMa model, but training simply got this error

lucasjinreal avatar May 24 '23 03:05 lucasjinreal