diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

CUDA out of memory when i want to train dreambooth

Open loboere opened this issue 3 years ago • 6 comments

Describe the bug

I'm using T4 with colab free , when I start training it tells me cuda error, it happens when I activate prior_preservation.

image

Run training

Launching training on one GPU.
Steps: 0%
1/450 [00:10<1:20:12, 10.72s/it, loss=0.0338]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-c6e3ce5f5a40> in <module>
      1 #@title Run training
      2 import accelerate
----> 3 accelerate.notebook_launcher(training_function, args=(text_encoder, vae, unet))
      4 with torch.no_grad():
      5     torch.cuda.empty_cache()

7 frames
/usr/local/lib/python3.7/dist-packages/accelerate/launchers.py in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port)
     81             else:
     82                 print("Launching training on one CPU.")
---> 83             function(*args)
     84 
     85     else:

<ipython-input-1-d9553ec566fc> in training_function(text_encoder, vae, unet)
    364                     loss = F.mse_loss(noise_pred, noise, reduction="none").mean([1, 2, 3]).mean()
    365 
--> 366                 accelerator.backward(loss)
    367                 accelerator.clip_grad_norm_(unet.parameters(), args.max_grad_norm)
    368                 optimizer.step()

/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py in backward(self, loss, **kwargs)
    882             self.scaler.scale(loss).backward(**kwargs)
    883         else:
--> 884             loss.backward(**kwargs)
    885 
    886     def unscale_gradients(self, optimizer=None):

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    394                 create_graph=create_graph,
    395                 inputs=inputs)
--> 396         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    397 
    398     def register_hook(self, hook):

/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    173     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    174         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 175         allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
    176 
    177 def grad(

/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py in apply(self, *args)
    251                                "of them.")
    252         user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn
--> 253         return user_fn(self, *args)
    254 
    255     def apply_jvp(self, *args):

/usr/local/lib/python3.7/dist-packages/torch/utils/checkpoint.py in backward(ctx, *args)
    144                 "none of output has requires_grad=True,"
    145                 " this checkpoint() is not necessary")
--> 146         torch.autograd.backward(outputs_with_grad, args_with_grad)
    147         grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None
    148                       for inp in detached_inputs)

/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    173     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    174         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 175         allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
    176 
    177 def grad(

RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 14.76 GiB total capacity; 12.24 GiB already allocated; 877.75 MiB free; 12.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Reproduction

No response

Logs

No response

System Info

T4 with colab free

loboere avatar Oct 02 '22 16:10 loboere

I am getting the same error on a RTX 3090 (24GB) using the example script with diffusers.git@688031c592a08832387761971f1e6ca504a5900b:

RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 23.70 GiB total capacity; 20.93 GiB already allocated; 449.69 MiB free; 21.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

skirsten avatar Oct 03 '22 15:10 skirsten

cc @patil-suraj

patrickvonplaten avatar Oct 04 '22 12:10 patrickvonplaten

Hi @loboere , thanks for the issue! The T4 GPU has less than 16GB VRAM so it does not fit dreambooth training with prior preservation. It should fit without prior preservation. This example should work fine on P100/V100 colab, try resetting the colab and see if you get a P100.

@skirsten Could you post the training arguments that you are using ? You should enable gradient_checkpointing, and use_8bit_adam to be able to train on 24GB gpu.

patil-suraj avatar Oct 05 '22 10:10 patil-suraj

I played around with some of the settings and they indeed fix the "CUDA out of memory" problem on the RTX 3090 (24 GB):

args performance
--gradient_accumulation_steps=1 CUDA out of memory
--gradient_accumulation_steps=1 --gradient_checkpointing 0.97 steps/s @ 21.9 GB
--gradient_accumulation_steps=1 --use_8bit_adam 1.20 steps/s @ 23.4 GB :top:
--gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam 1.06 steps/s @ 15.8 GB
--gradient_accumulation_steps=2 CUDA out of memory
--gradient_accumulation_steps=2 --gradient_checkpointing 0.53 steps/s @ 21.9 GB
--gradient_accumulation_steps=2 --use_8bit_adam 0.63 steps/s @ 23.4 GB
--gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam 0.55 steps/s @ 15.8 GB
using fp16
--mixed_precision=fp16 --gradient_accumulation_steps=1 CUDA out of memory
--mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing 1.16 steps/s @ 22.0 GB
--mixed_precision=fp16 --gradient_accumulation_steps=1 --use_8bit_adam CUDA out of memory
--mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam 1.30 steps/s @ 16.9 GB :top:
--mixed_precision=fp16 --gradient_accumulation_steps=2 CUDA out of memory
--mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing 0.64 steps/s @ 21.9 GB
--mixed_precision=fp16 --gradient_accumulation_steps=2 --use_8bit_adam CUDA out of memory
--mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam 0.68 steps/s @ 16.9 GB

skirsten avatar Oct 06 '22 02:10 skirsten

I played around with some of the settings and they indeed fix the "CUDA out of memory" problem on 24 GB gpu:

args performance --gradient_accumulation_steps=1 CUDA out of memory --gradient_accumulation_steps=1 --gradient_checkpointing 0.97 steps/s @ 21.9 GB --gradient_accumulation_steps=1 --use_8bit_adam 1.20 steps/s @ 23.4 GB --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam 1.06 steps/s @ 15.8 GB --gradient_accumulation_steps=2 CUDA out of memory --gradient_accumulation_steps=2 --gradient_checkpointing 0.53 steps/s @ 21.9 GB --gradient_accumulation_steps=2 --use_8bit_adam 0.63 steps/s @ 23.4 GB --gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam 0.55 steps/s @ 15.8 GB using fp16 --mixed_precision=fp16 --gradient_accumulation_steps=1 CUDA out of memory --mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing 1.16 steps/s @ 22.0 GB --mixed_precision=fp16 --gradient_accumulation_steps=1 --use_8bit_adam CUDA out of memory --mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam 1.30 steps/s @ 16.9 GB --mixed_precision=fp16 --gradient_accumulation_steps=2 CUDA out of memory --mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing 0.64 steps/s @ 21.9 GB --mixed_precision=fp16 --gradient_accumulation_steps=2 --use_8bit_adam CUDA out of memory --mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam 0.68 steps/s @ 16.9 GB

Thanks for this very nice table @skirsten !

patrickvonplaten avatar Oct 07 '22 13:10 patrickvonplaten

Thanks a lot for the table @skirsten , this is very useful! Also, with #735 mixed_precision should get decent speed-up and it should also allow you to use full adam with bigger BS on 3090!

patil-suraj avatar Oct 10 '22 12:10 patil-suraj

Being trying to make my training not OOM and I am unable to make mixed precision work on dreambooth.

/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/accelerate/accelerator.py:179: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
Steps:   0%|                                                                                                                                                                                                                        | 0/4000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/tcapelle/Apps/diffusers/examples/dreambooth/train_dreambooth.py", line 592, in <module>
    main()
  File "/home/tcapelle/Apps/diffusers/examples/dreambooth/train_dreambooth.py", line 557, in main
    accelerator.backward(loss)
  File "/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/accelerate/accelerator.py", line 1005, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Half but expected Float

tcapelle avatar Oct 12 '22 15:10 tcapelle

Hey @tcapelle this is now fixed in #826

patil-suraj avatar Oct 20 '22 12:10 patil-suraj

it is!

tcapelle avatar Oct 20 '22 12:10 tcapelle

Update on this issue... I was getting same error on free colab, however when I updated diffusers to 0.10.0 as recommended here it worked with prior_preservation without memory error.

Just change pip install https://github.com/huggingface/diffusers.git to diffusers==0.10.0.

FelippeChemello avatar Dec 27 '22 02:12 FelippeChemello

you can try to select ‘use LORA’ in general setting panel

jarrowkidd avatar Mar 13 '23 13:03 jarrowkidd

I played around with some of the settings and they indeed fix the "CUDA out of memory" problem on 24 GB gpu: args performance --gradient_accumulation_steps=1 CUDA out of memory --gradient_accumulation_steps=1 --gradient_checkpointing 0.97 steps/s @ 21.9 GB --gradient_accumulation_steps=1 --use_8bit_adam 1.20 steps/s @ 23.4 GB --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam 1.06 steps/s @ 15.8 GB --gradient_accumulation_steps=2 CUDA out of memory --gradient_accumulation_steps=2 --gradient_checkpointing 0.53 steps/s @ 21.9 GB --gradient_accumulation_steps=2 --use_8bit_adam 0.63 steps/s @ 23.4 GB --gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam 0.55 steps/s @ 15.8 GB using fp16 --mixed_precision=fp16 --gradient_accumulation_steps=1 CUDA out of memory --mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing 1.16 steps/s @ 22.0 GB --mixed_precision=fp16 --gradient_accumulation_steps=1 --use_8bit_adam CUDA out of memory --mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam 1.30 steps/s @ 16.9 GB --mixed_precision=fp16 --gradient_accumulation_steps=2 CUDA out of memory --mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing 0.64 steps/s @ 21.9 GB --mixed_precision=fp16 --gradient_accumulation_steps=2 --use_8bit_adam CUDA out of memory --mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam 0.68 steps/s @ 16.9 GB

Thanks for this very nice table @skirsten !

@patrickvonplaten why did I still encounter cuda oom error following your settings? Is it because I was training sdxl ? I just use code from official diffusers examples to train sdxl lora dreambooth, and my gpu is 3090.

joey0922 avatar Apr 19 '24 06:04 joey0922