diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

training Lora with 2.1 able on a 24 GB GPU?

Open Marcophono2 opened this issue 2 years ago • 2 comments

Hello!

I really would like to train Lora on my 24 GB GPU but I always get OOM. Is there any parameter or setting which I missed and which could help to reduce the needed VRAM?

accelerate launch --mixed_precision="bf16" training.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME --caption_column="text" \
  --resolution=768 --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=100 --checkpointing_steps=5000 \
  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --seed=42 \
  --output_dir="/home/marc/Schreibtisch/AI/SD/marcLORA" \
  --validation_prompt="picture of marcophono playing chess" --report_to="wandb" \
  --num_validation_images=4 \
  --enable_xformers_memory_efficient_attention \
  --use_8bit_adam

Marc

Marcophono2 avatar Feb 16 '23 03:02 Marcophono2

Hey @Marcophono2,

Colud you specific what exactly training.py refers to? Do you mean the dreambooth LoRA training script?

patrickvonplaten avatar Feb 16 '23 14:02 patrickvonplaten

Sorry, @patrickvonplaten , I should have pointed to this. It's nothing else than the file

https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py

Marcophono2 avatar Feb 16 '23 20:02 Marcophono2

I see! I'm wondering whether you're code OOM's out during training or inference.

could you maybe try to remove this line:

--validation_prompt="picture of marcophono playing chess"

so that no evaluation is run, just so that we can check if this is the reason for OOM.

Also cc @sayakpaul here - it might be a good idea to see if we can get LoRA text-to-image to work for SD 2.x on a 24GB machine.

patrickvonplaten avatar Mar 06 '23 10:03 patrickvonplaten

Hi, I have the same problem. I cannot get LORA run on a 24GB GPU. The problem is not the inference, it happens when running "model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample" first time (actually, it happens in the attention_processor).

I'm a bit confused: the number of parameters introduced by LORA are not that many. Why is the method consuming so much memory? Also, when I run inference (without training) I have very low memory footprint. So it seems that only the training routine with LORA consumes so much memory.

Running the whole setup with 1.4 and 568x568 image size works, but even then it consumes around 18GB of memory.

kaibioinfo avatar Mar 11 '23 23:03 kaibioinfo

I'm currently playing around a bit to narrow down the error. What I can say is:

  • I can train complete finetuning (dreambooth) on my 24GB gpu even with using float32 format. LORA cannot be trained even when using float16
  • The number of gradients, that have to be stored in memory should be magnitudes smaller with LORA than with Dreambooth. I checked that only the LORA layers have gradients and this is the case. So the huge memory consumption somehow comes from the LORA crossattention processor during training
  • using Stable Diffusion 1.4 the memory consumption is only around 4GB (In my last post I wrote 18GB, but I cannot reproduce that)
  • The Lora Layers between Stable Diffusion 1.4 and 2.1 are almost identical, only the 4x512 layers are 4x768 in SD 2.1 which makes totally sense
  • the total number of trainable parameters in 1.4 is 797184 and in SD 2.1 is 829952

So there HAS TO BE a bug in training LORA with SD 2.1.

Edit: When I use 2.1 with input size 512 it runs with around 6GB. Going from 512 -> 768 ends up in OOM (so more than 24GB).

kaibioinfo avatar Mar 12 '23 11:03 kaibioinfo

I think I solved the problem. The script is not using the xformer variant of the Lora cross attention processor. I will try to make a fix.

kaibioinfo avatar Mar 12 '23 11:03 kaibioinfo

Here is the pull request: https://github.com/huggingface/diffusers/pull/2648

kaibioinfo avatar Mar 12 '23 12:03 kaibioinfo

Thanks a lot! Reviewed your PR.

sayakpaul avatar Mar 13 '23 06:03 sayakpaul

Hi, it seems that the same problem also occur in the train_text_to_image.py script: I am using a SLURM cluster with 4 K-80 with 48GB of RAM in a single node.

  File "train_text_to_image.py", line 729, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/home/mpozzi/miniconda3/envs/diffusion/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
[...]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 1; 11.17 GiB total capacity; 10.63 GiB already allocated; 12.1>

i have also tried to reduce a lot the resolution of my images but it doesn't help. here is my training scirpt:

accelerate launch --mixed_precision="fp16" --num_processes 4  --num_machines 1 --multi_gpu  --gpu_ids $CUDA_VISIBLE_DEVICES\
   train_text_to_image.py\
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$TRAIN_DIR \
  --use_ema \
  --resolution=56 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="prova"

Should i use the LORA script?

Mat-Po avatar Mar 16 '23 11:03 Mat-Po

You could try the LoRA script. However, I am not sure if k80 would be a good one to use here since we cannot take advantage of Fp16 computation here (K80 doesn't have tensor cores for speeding up computations further).

Maybe using DeepSpeed CPU offloading would be a better choice if LoRA is out of question. @williamberman do you have any suggestions here?

sayakpaul avatar Mar 16 '23 11:03 sayakpaul

@Mat-Po I'm not familiar with the memory requirements for our lora training scripts but I would recommend looking at the other documented memory requirements for the other training scripts for combinations of flags to enable. For a 12 GB GPU, I would at a minimum recommend additionally enabling xformers and 8bit optimizer. Leaving ema enabled is probably fine.

Also note, that the 4 GPU case isn't actually 48 GB of total ram for a forward pass of a single training sample. Accelerate multi gpu runs in data parallel and do the whole model is replicated on every GPU. For a single forward pass with a batch size of one, you are still limited by the maximum available memory on one GPU.

It would not surprise me if 12GB ram is not enough with the given configuration which looks like it just has gradient checkpointing enabled. For context, controlnet training with just gradient accumulation still requires 20GB vram for a batch size of 1.

williamberman avatar Mar 21 '23 02:03 williamberman

I'm going to close the issue as it sounds like the xformers PR fixed the original memory issue. Anyone feel free to re-open or open a new issue for memory issues with Lora training

williamberman avatar Mar 21 '23 02:03 williamberman

thanks @williamberman and @sayakpaul , An update here: I can confirm that using the LoRA script worked out just fine. I managed to make it run on 2 1080 with 8GB of RAM, therefore considering what William said 8 GB is enough for LoRA while defiantly not enough for the train_text_to_image.py script .

Mat-Po avatar Mar 21 '23 09:03 Mat-Po