diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

With examples/dreambooth /README_flux.md guide setting up and training, got cuda OOM with 3090Ti 24GB

Open riflemanl opened this issue 1 year ago β€’ 1 comments

Describe the bug

Followed the guide examples/dreambooth/README_flux.md guide setting up and training, got cuda OOM with 3090Ti 24GB.

Reproduction

PC got 256GB RAM 3090Ti VRAM 24GB torch 2.4.1 + cuda 12.1 export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True accelerate==1.0.1 transformers==4.45.2

Logs

2024-10-21 22:23:00.221007: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-21 22:23:00.231181: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-21 22:23:00.243106: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-21 22:23:00.246839: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-21 22:23:00.256022: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-21 22:23:01.042086: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
10/21/2024 22:23:01 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 11602.50it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:06<00:00,  3.39s/it]
Fetching 3 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 7227.40it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
  File "/mnt/sat/ai/diffusers-train/diffusers/examples/dreambooth/train_dreambooth_lora_flux.py", line 1892, in <module>
    main(args)
  File "/mnt/sat/ai/diffusers-train/diffusers/examples/dreambooth/train_dreambooth_lora_flux.py", line 1182, in main
    text_encoder_two.to(accelerator.device, dtype=weight_dtype)
  File "/mnt/sat/ai/diffusers-train/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2958, in to
    return super().to(*args, **kwargs)
  File "/mnt/sat/ai/diffusers-train/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to
    return self._apply(convert)
  File "/mnt/sat/ai/diffusers-train/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/mnt/sat/ai/diffusers-train/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/mnt/sat/ai/diffusers-train/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/mnt/sat/ai/diffusers-train/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
    param_applied = fn(param)
  File "/mnt/sat/ai/diffusers-train/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in convert
    return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 13.25 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 16.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/mnt/sat/ai/diffusers-train/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/sat/ai/diffusers-train/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/mnt/sat/ai/diffusers-train/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1168, in launch_command
    simple_launcher(args)
  File "/mnt/sat/ai/diffusers-train/lib/python3.10/site-packages/accelerate/commands/launch.py", line 763, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/mnt/sat/ai/diffusers-train/bin/python', 'train_dreambooth_lora_flux.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--instance_data_dir=../../../../SD-Downloads/AnnieOnly1024', '--output_dir=lora-flux', '--mixed_precision=bf16', '--instance_prompt=sk3anni3', '--resolution=1024', '--train_batch_size=1', '--guidance_scale=1', '--gradient_accumulation_steps=4', '--optimizer=prodigy', '--learning_rate=1e-1', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=500', '--validation_prompt=sk3anni3 in apartment', '--validation_epochs=25', '--seed=0']' returned non-zero exit status 1.

System Info

Diffusers version is latest main branch code today, 2024-10-21, coz previous release tag still not yet support dreambooth Flux Lora training.

Who can help?

No response

riflemanl avatar Oct 21 '24 14:10 riflemanl

I don't think that the flux dreambooth training scripts are memory-optimized out of the box. You could try using it with DeepSpeed and enabling gradient checkpointing, which should lower the memory requirements by a lot. For serious training experiments, we recommend using something like SimpleTuner which uses diffusers as a backend and supports many important training related components easily and is memory-efficient.

a-r-r-o-w avatar Oct 24 '24 03:10 a-r-r-o-w

You could also give our quantization example a try and let us know how it goes: https://github.com/huggingface/diffusers/tree/main/examples/research_projects/flux_lora_quantization

sayakpaul avatar Nov 01 '24 03:11 sayakpaul

Can you try regarding #9829 ? I have saved memory by implementing this :)

leisuzz avatar Nov 01 '24 05:11 leisuzz

[a-r-r-o-w]

I don't think that the flux dreambooth training scripts are memory-optimized out of the box. You could try using it with DeepSpeed and enabling gradient checkpointing, which should lower the memory requirements by a lot. For serious training experiments, we recommend using something like SimpleTuner which uses diffusers as a backend and supports many important training related components easily and is memory-efficient.

SimpleTuner easily got stuck when doing: $ poetry install ... so giving up.

riflemanl avatar Nov 02 '24 12:11 riflemanl

Can you try regarding #9829 ? I have saved memory by implementing this :)

Just checked the changes, looks awesome! It must be helpful. But I've stepped over using ai-toolkit single GPU trained weeks ago first, moving to struggle on diffusers inferencing Flux + Lora + ControlNet openpose now, that also facing OOM, should use fp8 or schnell for that, I'll created another ticket for that issue soon.. Will be back to test and verify this some more days later.

riflemanl avatar Nov 02 '24 13:11 riflemanl

@riflemanl I used bf16 with the deepspeed and accelerate, it should work. Another reason is the FLUX is 12B model, it costs lots of memory

leisuzz avatar Nov 03 '24 09:11 leisuzz

@leisuzz : Oh!? I just tried the patch, it still got OOM, I found that you tried to images = None and del pipeline at the end, but I got OOM at begining of training when trying to prepare allocation VRAM, I checked the nvidia-smi, its memory usage from 1GB to 24GB in 5 seconds and then crashed.. My accelerate launch parameters as following: export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True accelerate launch train_dreambooth_lora_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--instance_data_dir=$INSTANCE_DIR
--output_dir=$OUTPUT_DIR
--mixed_precision="bf16"
--instance_prompt="p3r5on"
--resolution=1024
--train_batch_size=1
--guidance_scale=1
--gradient_accumulation_steps=4
--optimizer="prodigy"
--learning_rate=1e-2
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=2000
--validation_prompt="p3r5on in apartment"
--validation_epochs=25
--use_8bit_adam
--seed="0"

riflemanl avatar Nov 08 '24 12:11 riflemanl

24GB is not enought, what's your hardware setting? Try to reduce the batch size and resolution

leisuzz avatar Nov 08 '24 14:11 leisuzz

24GB is not enought, what's your hardware setting? Try to reduce the batch size and resolution @leisuzz : I give hardware at beginning, I've tried to reduce resolution to 512, even 256 also tried, OOM still be the same... I think I have to use fp8 to train.. But the training code not yet support fp8

riflemanl avatar Nov 08 '24 14:11 riflemanl

Batch size 1 with resolution 256 will cost around 40GB on my 8 GPU. I think you should try off loads to cpu

leisuzz avatar Nov 08 '24 14:11 leisuzz

Where I should add off loads to cpu code?

riflemanl avatar Nov 08 '24 15:11 riflemanl

Try with deep speed, but I don't think one gpu is enough

leisuzz avatar Nov 08 '24 15:11 leisuzz

But ai-toolkit can train 1024 Flux.1-dev Lora without problem, but it cannot utilize my 2 GPUs to make it faster, so, I'm trying diffusers + accelerate here. If diffusers cannot make it, or even can make with 256 only, that makes me can only giving up..

riflemanl avatar Nov 08 '24 15:11 riflemanl

It also depends on the size of the dataset

leisuzz avatar Nov 08 '24 15:11 leisuzz

This should easily fit within a 24GB GPU: https://github.com/huggingface/diffusers/tree/main/examples/research_projects/flux_lora_quantization

I currently don't have the bandwidth to debug this further but if you don't want to use quantization, you can consider using other trainers like https://github.com/ostris/ai-toolkit/.

sayakpaul avatar Nov 08 '24 23:11 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Dec 03 '24 15:12 github-actions[bot]

Closing this due to inactivity.

sayakpaul avatar Dec 03 '24 16:12 sayakpaul

you can change ”--num_validation_imagesβ€œ 1

gityihang avatar Feb 21 '25 09:02 gityihang