diffusers ValueError: Query/Key/Value should either all have the same dtype training LORA

trafficstars

Describe the bug

LORA training don't work with mixed precision enabled:

File "/home/imgen/miniconda3/envs/py32/lib/python3.11/site-packages/xformers/ops/fmha/init.py", line 348, in _memory_efficient_at tention_forward_requires_grad inp.validate_inputs() File "/home/imgen/miniconda3/envs/py32/lib/python3.11/site-packages/xformers/ops/fmha/common.py", line 121, in validate_inputs raise ValueError( ValueError: Query/Key/Value should either all have the same dtype, or (in the quantized case) Key/Value should have dtype torch.int32 query.dtype: torch.float32 key.dtype : torch.float16 value.dtype: torch.float16

full stacktrace in the attachment

Reproduction

export MODEL_NAME=/home/imgen/models/SDXL/juggernautXL_v8Rundiffusion/
export OUTPUT_DIR=`pwd`/poke11-lora
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch examples/text_to_image/train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
            --dataset_name=$DATASET_NAME \
              --dataloader_num_workers=8 \
                --resolution=1024 \
                  --center_crop \
                    --random_flip \
                     --enable_xformers_memory_efficient_attention \
                        --mixed_precision=fp16 \
                      --train_batch_size=1 \
                        --gradient_accumulation_steps=4 \
                          --max_train_steps=15000 \
                            --learning_rate=1e-04 \
                              --max_grad_norm=1 \
                                --lr_scheduler="cosine" \
                                  --lr_warmup_steps=0 \
                                    --output_dir=${OUTPUT_DIR} \
                                    --gradient_checkpointing \
                                    --use_8bit_adam \
                                            --checkpointing_steps=500 \
                                              --validation_prompt="A pokemon with red nose." \
                                                --seed=1337

Logs

log.txt

System Info

pytorch 2.1.2+cu118 diffusers 0.26.0.dev0 accelerate 0.26.1

Who can help?

@sayakpaul @patrickvonplaten

Feb 01 '24 14:02 noskill

It's a known problem when using xformers. I recommend building xformers from the source to be able to fix it.

Feb 03 '24 04:02 sayakpaul

@sayakpaul i built xformers from source but the issue is still present

Feb 08 '24 19:02 noskill

Does it work with PyTorch 2.0.0?

Feb 09 '24 01:02 sayakpaul

Looks the same with pytorch 2.0.0, xformers installed with pip install xformers==0.0.19


  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 317, in _memory_efficient_attention_forward_requires_grad
    inp.validate_inputs()
  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/xformers/ops/fmha/common.py", line 73, in validate_inputs
    raise ValueError(
ValueError: Query/Key/Value should all have the same dtype

Steps:   0%|                                                        | 0/15000 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "/home/imgen/miniconda3/envs/py31/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
    simple_launcher(args)
  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/imgen/miniconda3/envs/py31/bin/python', 'examples/text_to_image/train_text_to_image_lora_sdxl.py', '--pretrained_model_name_or_path=/home/imgen/models/SDXL/juggernautXL_v8Rundiffusion/', '--dataset_name=lambdalabs/pokemon-blip-captions', '--dataloader_num_workers=8', '--resolution=1024', '--center_crop', '--random_flip', '--enable_xformers_memory_efficient_attention', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--max_train_steps=15000', '--learning_rate=1e-04', '--max_grad_norm=1', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--output_dir=/home/imgen/projects/diffusers/poke11-lora', '--gradient_checkpointing', '--use_8bit_adam', '--checkpointing_steps=500', '--validation_prompt=A pokemon with red nose.', '--seed=1337']' returned non-zero exit status 1.

(py31) imgen@k6:~/projects/diffusers$ python3
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.0.0+cu118'
>>> import xformers
>>> xformers.__version__
'0.0.19'
>>>

Feb 09 '24 10:02 noskill

How about PyTorch 1.13?

I have seen people reporting the same error and resolving it with xformers built from source. Example: https://github.com/huggingface/accelerate/issues/2182#issuecomment-1864127640. If that doesn't solve the problem, it could very well be a recent PyTorch / xformers training incompatibility issue. Sadly, we don't have time to look into that right now.

Feb 09 '24 10:02 sayakpaul

Do you happen to have a time frame for when you can look into that @sayakpaul?

Feb 11 '24 13:02 JakobLS

Sadly, no.

Multiple folks have concluded that it is PyTorch/xFormers version/build issue that arises even outside of diffusers.

So, we need to be cognizant of that.

Feb 11 '24 13:02 sayakpaul

facing the same issue.. using following versions:

>>> torch.__version__
'2.2.0+cu121'
>>> import xformers
>>> xformers.__version__
'0.0.24'

this occurs at 200 steps when validation block runs..

Feb 13 '24 17:02 rohit901

Does it not happen when you run without xformers? Also, you're on the latest diffusers, training with peft, yeah?

Feb 14 '24 01:02 sayakpaul

doesn't happen without xformers. I also tried using the same pre-compiled binaries corresponding to CUDA 11.8 for pytorch and xformers. However, the issue still persisted. Removing the xformers flag from the launch command did help. Yes, I'm using latest diffusers and training with peft (the LoRA script given in LCM/consistency distillation example)

Feb 14 '24 02:02 rohit901

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 09 '24 15:03 github-actions[bot]

not stale

Mar 12 '24 18:03 noskill

any updates on this?

Mar 12 '24 18:03 rohit901

https://github.com/facebookresearch/xformers/issues/934 this is where we're at. Cannot do much anything sadly.

Mar 13 '24 01:03 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 06 '24 15:04 github-actions[bot]

diffusers diffusers copied to clipboard

ValueError: Query/Key/Value should either all have the same dtype training LORA

Describe the bug

Reproduction

Logs

System Info

Who can help?

diffusers
diffusers copied to clipboard