diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

ValueError: Query/Key/Value should either all have the same dtype training LORA

Open noskill opened this issue 1 year ago • 11 comments
trafficstars

Describe the bug

LORA training don't work with mixed precision enabled:

File "/home/imgen/miniconda3/envs/py32/lib/python3.11/site-packages/xformers/ops/fmha/init.py", line 348, in _memory_efficient_at tention_forward_requires_grad inp.validate_inputs() File "/home/imgen/miniconda3/envs/py32/lib/python3.11/site-packages/xformers/ops/fmha/common.py", line 121, in validate_inputs raise ValueError( ValueError: Query/Key/Value should either all have the same dtype, or (in the quantized case) Key/Value should have dtype torch.int32 query.dtype: torch.float32 key.dtype : torch.float16 value.dtype: torch.float16

full stacktrace in the attachment

Reproduction

export MODEL_NAME=/home/imgen/models/SDXL/juggernautXL_v8Rundiffusion/
export OUTPUT_DIR=`pwd`/poke11-lora
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch examples/text_to_image/train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
            --dataset_name=$DATASET_NAME \
              --dataloader_num_workers=8 \
                --resolution=1024 \
                  --center_crop \
                    --random_flip \
                     --enable_xformers_memory_efficient_attention \
                        --mixed_precision=fp16 \
                      --train_batch_size=1 \
                        --gradient_accumulation_steps=4 \
                          --max_train_steps=15000 \
                            --learning_rate=1e-04 \
                              --max_grad_norm=1 \
                                --lr_scheduler="cosine" \
                                  --lr_warmup_steps=0 \
                                    --output_dir=${OUTPUT_DIR} \
                                    --gradient_checkpointing \
                                    --use_8bit_adam \
                                            --checkpointing_steps=500 \
                                              --validation_prompt="A pokemon with red nose." \
                                                --seed=1337

Logs

log.txt

System Info

pytorch 2.1.2+cu118 diffusers 0.26.0.dev0 accelerate 0.26.1

Who can help?

@sayakpaul @patrickvonplaten

noskill avatar Feb 01 '24 14:02 noskill

It's a known problem when using xformers. I recommend building xformers from the source to be able to fix it.

sayakpaul avatar Feb 03 '24 04:02 sayakpaul

@sayakpaul i built xformers from source but the issue is still present

noskill avatar Feb 08 '24 19:02 noskill

Does it work with PyTorch 2.0.0?

sayakpaul avatar Feb 09 '24 01:02 sayakpaul

Looks the same with pytorch 2.0.0, xformers installed with pip install xformers==0.0.19


  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 317, in _memory_efficient_attention_forward_requires_grad
    inp.validate_inputs()
  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/xformers/ops/fmha/common.py", line 73, in validate_inputs
    raise ValueError(
ValueError: Query/Key/Value should all have the same dtype

Steps:   0%|                                                        | 0/15000 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "/home/imgen/miniconda3/envs/py31/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
    simple_launcher(args)
  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/imgen/miniconda3/envs/py31/bin/python', 'examples/text_to_image/train_text_to_image_lora_sdxl.py', '--pretrained_model_name_or_path=/home/imgen/models/SDXL/juggernautXL_v8Rundiffusion/', '--dataset_name=lambdalabs/pokemon-blip-captions', '--dataloader_num_workers=8', '--resolution=1024', '--center_crop', '--random_flip', '--enable_xformers_memory_efficient_attention', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--max_train_steps=15000', '--learning_rate=1e-04', '--max_grad_norm=1', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--output_dir=/home/imgen/projects/diffusers/poke11-lora', '--gradient_checkpointing', '--use_8bit_adam', '--checkpointing_steps=500', '--validation_prompt=A pokemon with red nose.', '--seed=1337']' returned non-zero exit status 1.
(py31) imgen@k6:~/projects/diffusers$ python3
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.0.0+cu118'
>>> import xformers
>>> xformers.__version__
'0.0.19'
>>> 

noskill avatar Feb 09 '24 10:02 noskill

How about PyTorch 1.13?

I have seen people reporting the same error and resolving it with xformers built from source. Example: https://github.com/huggingface/accelerate/issues/2182#issuecomment-1864127640. If that doesn't solve the problem, it could very well be a recent PyTorch / xformers training incompatibility issue. Sadly, we don't have time to look into that right now.

sayakpaul avatar Feb 09 '24 10:02 sayakpaul

Do you happen to have a time frame for when you can look into that @sayakpaul?

JakobLS avatar Feb 11 '24 13:02 JakobLS

Sadly, no.

Multiple folks have concluded that it is PyTorch/xFormers version/build issue that arises even outside of diffusers.

So, we need to be cognizant of that.

sayakpaul avatar Feb 11 '24 13:02 sayakpaul

facing the same issue.. using following versions:

>>> torch.__version__
'2.2.0+cu121'
>>> import xformers
>>> xformers.__version__
'0.0.24'

this occurs at 200 steps when validation block runs..

rohit901 avatar Feb 13 '24 17:02 rohit901

Does it not happen when you run without xformers? Also, you're on the latest diffusers, training with peft, yeah?

sayakpaul avatar Feb 14 '24 01:02 sayakpaul

doesn't happen without xformers. I also tried using the same pre-compiled binaries corresponding to CUDA 11.8 for pytorch and xformers. However, the issue still persisted. Removing the xformers flag from the launch command did help. Yes, I'm using latest diffusers and training with peft (the LoRA script given in LCM/consistency distillation example)

rohit901 avatar Feb 14 '24 02:02 rohit901

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 09 '24 15:03 github-actions[bot]

not stale

noskill avatar Mar 12 '24 18:03 noskill

any updates on this?

rohit901 avatar Mar 12 '24 18:03 rohit901

https://github.com/facebookresearch/xformers/issues/934 this is where we're at. Cannot do much anything sadly.

sayakpaul avatar Mar 13 '24 01:03 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 06 '24 15:04 github-actions[bot]