diffusers
diffusers copied to clipboard
CUDA out of memory when i want to train dreambooth
Describe the bug
I'm using T4 with colab free , when I start training it tells me cuda error, it happens when I activate prior_preservation.

Run training
Launching training on one GPU.
Steps: 0%
1/450 [00:10<1:20:12, 10.72s/it, loss=0.0338]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-2-c6e3ce5f5a40> in <module>
1 #@title Run training
2 import accelerate
----> 3 accelerate.notebook_launcher(training_function, args=(text_encoder, vae, unet))
4 with torch.no_grad():
5 torch.cuda.empty_cache()
7 frames
/usr/local/lib/python3.7/dist-packages/accelerate/launchers.py in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port)
81 else:
82 print("Launching training on one CPU.")
---> 83 function(*args)
84
85 else:
<ipython-input-1-d9553ec566fc> in training_function(text_encoder, vae, unet)
364 loss = F.mse_loss(noise_pred, noise, reduction="none").mean([1, 2, 3]).mean()
365
--> 366 accelerator.backward(loss)
367 accelerator.clip_grad_norm_(unet.parameters(), args.max_grad_norm)
368 optimizer.step()
/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py in backward(self, loss, **kwargs)
882 self.scaler.scale(loss).backward(**kwargs)
883 else:
--> 884 loss.backward(**kwargs)
885
886 def unscale_gradients(self, optimizer=None):
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
394 create_graph=create_graph,
395 inputs=inputs)
--> 396 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
397
398 def register_hook(self, hook):
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 175 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
176
177 def grad(
/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py in apply(self, *args)
251 "of them.")
252 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn
--> 253 return user_fn(self, *args)
254
255 def apply_jvp(self, *args):
/usr/local/lib/python3.7/dist-packages/torch/utils/checkpoint.py in backward(ctx, *args)
144 "none of output has requires_grad=True,"
145 " this checkpoint() is not necessary")
--> 146 torch.autograd.backward(outputs_with_grad, args_with_grad)
147 grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None
148 for inp in detached_inputs)
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 175 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
176
177 def grad(
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 14.76 GiB total capacity; 12.24 GiB already allocated; 877.75 MiB free; 12.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Reproduction
No response
Logs
No response
System Info
T4 with colab free
I am getting the same error on a RTX 3090 (24GB) using the example script with diffusers.git@688031c592a08832387761971f1e6ca504a5900b:
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 23.70 GiB total capacity; 20.93 GiB already allocated; 449.69 MiB free; 21.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
cc @patil-suraj
Hi @loboere , thanks for the issue! The T4 GPU has less than 16GB VRAM so it does not fit dreambooth training with prior preservation. It should fit without prior preservation. This example should work fine on P100/V100 colab, try resetting the colab and see if you get a P100.
@skirsten Could you post the training arguments that you are using ?
You should enable gradient_checkpointing, and use_8bit_adam to be able to train on 24GB gpu.
I played around with some of the settings and they indeed fix the "CUDA out of memory" problem on the RTX 3090 (24 GB):
| args | performance |
|---|---|
--gradient_accumulation_steps=1 |
CUDA out of memory |
--gradient_accumulation_steps=1 --gradient_checkpointing |
0.97 steps/s @ 21.9 GB |
--gradient_accumulation_steps=1 --use_8bit_adam |
1.20 steps/s @ 23.4 GB :top: |
--gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam |
1.06 steps/s @ 15.8 GB |
--gradient_accumulation_steps=2 |
CUDA out of memory |
--gradient_accumulation_steps=2 --gradient_checkpointing |
0.53 steps/s @ 21.9 GB |
--gradient_accumulation_steps=2 --use_8bit_adam |
0.63 steps/s @ 23.4 GB |
--gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam |
0.55 steps/s @ 15.8 GB |
| using fp16 | |
--mixed_precision=fp16 --gradient_accumulation_steps=1 |
CUDA out of memory |
--mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing |
1.16 steps/s @ 22.0 GB |
--mixed_precision=fp16 --gradient_accumulation_steps=1 --use_8bit_adam |
CUDA out of memory |
--mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam |
1.30 steps/s @ 16.9 GB :top: |
--mixed_precision=fp16 --gradient_accumulation_steps=2 |
CUDA out of memory |
--mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing |
0.64 steps/s @ 21.9 GB |
--mixed_precision=fp16 --gradient_accumulation_steps=2 --use_8bit_adam |
CUDA out of memory |
--mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam |
0.68 steps/s @ 16.9 GB |
I played around with some of the settings and they indeed fix the "CUDA out of memory" problem on 24 GB gpu:
args performance
--gradient_accumulation_steps=1CUDA out of memory--gradient_accumulation_steps=1 --gradient_checkpointing0.97 steps/s @ 21.9 GB--gradient_accumulation_steps=1 --use_8bit_adam1.20 steps/s @ 23.4 GB--gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam1.06 steps/s @ 15.8 GB--gradient_accumulation_steps=2CUDA out of memory--gradient_accumulation_steps=2 --gradient_checkpointing0.53 steps/s @ 21.9 GB--gradient_accumulation_steps=2 --use_8bit_adam0.63 steps/s @ 23.4 GB--gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam0.55 steps/s @ 15.8 GB using fp16--mixed_precision=fp16 --gradient_accumulation_steps=1CUDA out of memory--mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing1.16 steps/s @ 22.0 GB--mixed_precision=fp16 --gradient_accumulation_steps=1 --use_8bit_adamCUDA out of memory--mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam1.30 steps/s @ 16.9 GB--mixed_precision=fp16 --gradient_accumulation_steps=2CUDA out of memory--mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing0.64 steps/s @ 21.9 GB--mixed_precision=fp16 --gradient_accumulation_steps=2 --use_8bit_adamCUDA out of memory--mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam0.68 steps/s @ 16.9 GB
Thanks for this very nice table @skirsten !
Thanks a lot for the table @skirsten , this is very useful!
Also, with #735 mixed_precision should get decent speed-up and it should also allow you to use full adam with bigger BS on 3090!
Being trying to make my training not OOM and I am unable to make mixed precision work on dreambooth.
/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/accelerate/accelerator.py:179: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
Steps: 0%| | 0/4000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/tcapelle/Apps/diffusers/examples/dreambooth/train_dreambooth.py", line 592, in <module>
main()
File "/home/tcapelle/Apps/diffusers/examples/dreambooth/train_dreambooth.py", line 557, in main
accelerator.backward(loss)
File "/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/accelerate/accelerator.py", line 1005, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/tcapelle/mambaforge/envs/dream/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Half but expected Float
Hey @tcapelle this is now fixed in #826
it is!
Update on this issue...
I was getting same error on free colab, however when I updated diffusers to 0.10.0 as recommended here it worked with prior_preservation without memory error.
Just change pip install https://github.com/huggingface/diffusers.git to diffusers==0.10.0.
you can try to select ‘use LORA’ in general setting panel
I played around with some of the settings and they indeed fix the "CUDA out of memory" problem on 24 GB gpu: args performance
--gradient_accumulation_steps=1CUDA out of memory--gradient_accumulation_steps=1 --gradient_checkpointing0.97 steps/s @ 21.9 GB--gradient_accumulation_steps=1 --use_8bit_adam1.20 steps/s @ 23.4 GB--gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam1.06 steps/s @ 15.8 GB--gradient_accumulation_steps=2CUDA out of memory--gradient_accumulation_steps=2 --gradient_checkpointing0.53 steps/s @ 21.9 GB--gradient_accumulation_steps=2 --use_8bit_adam0.63 steps/s @ 23.4 GB--gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam0.55 steps/s @ 15.8 GB using fp16--mixed_precision=fp16 --gradient_accumulation_steps=1CUDA out of memory--mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing1.16 steps/s @ 22.0 GB--mixed_precision=fp16 --gradient_accumulation_steps=1 --use_8bit_adamCUDA out of memory--mixed_precision=fp16 --gradient_accumulation_steps=1 --gradient_checkpointing --use_8bit_adam1.30 steps/s @ 16.9 GB--mixed_precision=fp16 --gradient_accumulation_steps=2CUDA out of memory--mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing0.64 steps/s @ 21.9 GB--mixed_precision=fp16 --gradient_accumulation_steps=2 --use_8bit_adamCUDA out of memory--mixed_precision=fp16 --gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam0.68 steps/s @ 16.9 GBThanks for this very nice table @skirsten !
@patrickvonplaten why did I still encounter cuda oom error following your settings? Is it because I was training sdxl ? I just use code from official diffusers examples to train sdxl lora dreambooth, and my gpu is 3090.