sd-scripts the code of new version is using more VRAM

Hey, I am encountering the same problem today!! I have two cloned codes of sd-scripts. One was cloned in 12,2023, and the other was downloaded in 2024, Feb 28. But I found the new code always reported"out of memory" by using the same configuration as follows: --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \ --vae=madebyollin/sdxl-vae-fp16-fix \ --dataset_config=/home/lyh/sdvs/sd-scripts/config/finetune.toml \ --output_dir=/home/lyh/sd-scripts/output/finetune_15W \ --output_name=finetune_15W \ --save_model_as=safetensors \ --save_every_n_epochs=1 \ --save_precision="fp16" \ --max_token_length=225 \ --min_timestep=0 \ --max_timestep=1000 \ --max_train_epochs=2000 \ --learning_rate=4e-6 \ --lr_scheduler="constant" \ --optimizer_type="AdamW8bit" \ --xformers \ --gradient_checkpointing \ --gradient_accumulation_steps=128 \ --mem_eff_attn \ --mixed_precision="fp16" \ --logging_dir=logs \

The wired thing comes: the VRAM occupation with new code: 0ed4b03598426da65535656a625367f

the VRAM occupation with old code: da54cdaeb7aa6b8f4f82c2c7a3f7920

why? what is different?

Feb 29 '24 13:02 Iipython

The PR #989 was merged into main branch 21, Dec 2023. I think it may cause this issue. Please add --ddp_gradient_as_bucket_view and --ddp_static_graph to reduce VRAM usage with multi GPU training.

Feb 29 '24 23:02 kohya-ss

@kohya-ss these flags didn't help me with multi-gpu training... I have 3x4090 onboard. This is how I run sdxl_train.py with single GPU (I setup it for accelerate as well):

accelerate launch --num_cpu_threads_per_process=2 "/home/storuky/ml/train/kohya/sd-scripts/sdxl_train.py" --cache_text_encoder_outputs --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=1024 --gradient_checkpointing --learning_rate="2e-06" --logging_dir="/home/storuky/ml/train/data_dir/log" --lr_scheduler="constant" --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3200" --mixed_precision="bf16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --optimizer_type="Adafactor" --output_dir="/home/storuky/ml/train/data_dir/model" --output_name="OutModel" --pretrained_model_name_or_path="/home/storuky/ml/sd/stable-diffusion-webui-forge/models/Stable-diffusion/Training2-000005.safetensors" --reg_data_dir="/home/storuky/ml/train/data_dir/reg" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="bf16" --train_batch_size="1" --train_data_dir="/home/storuky/ml/train/data_dir/img" --xformers

It's going well. But when I configure accelerate to use 3 gpu and run this command with --multi_gpu and --num_processes=3:

accelerate launch --num_processes=3 --multi_gpu --num_cpu_threads_per_process=2 "/home/storuky/ml/train/kohya/sd-scripts/sdxl_train.py" --cache_text_encoder_outputs --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=1024 --gradient_checkpointing --learning_rate="2e-06" --logging_dir="/home/storuky/ml/train/data_dir/log" --lr_scheduler="constant"  --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3200" --mixed_precision="bf16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --optimizer_type="Adafactor" --output_dir="/home/storuky/ml/train/data_dir/model" --output_name="OutModel" --pretrained_model_name_or_path="/home/storuky/ml/sd/stable-diffusion-webui-forge/models/Stable-diffusion/Training2-000005.safetensors" --reg_data_dir="/home/storuky/ml/train/data_dir/reg" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="bf16" --train_batch_size="1" --train_data_dir="/home/storuky/ml/train/data_dir/img" --xformers

I'm getting OOM error. I tried to add --ddp_gradient_as_bucket_view and --ddp_static_graph as you mentioned but I'm still getting OOM.

Mar 21 '24 12:03 storuky

I reverted this PR locally and now it uses less VRAM.

Mar 21 '24 13:03 storuky

PR #989 fixes gradients synchronization. If #989 is reverted, the gradient is not synchronized, so it is similar to single GPU training in my understanding.

I'm not familiar with multiple GPU training, but could you try the training with --full_bf16 option? If it works, there might be some overhead in multi GPU training, and 24gb may not be sufficient.

Mar 21 '24 13:03 kohya-ss

@kohya-ss yes, full_bf16 works well (if we talk about VRAM usage) but it has much worse results in terms of accuracy 🤷‍♂️ For example, hair sticks together as dirty. Small detailed objects turn into blots... etc... Probably, full_bf16 may need another optimizer/LR/Scheduler setup... Do you have some handy notes of what we need to know about full_bf16?

Mar 21 '24 13:03 storuky

sd-scripts sd-scripts copied to clipboard

the code of new version is using more VRAM

sd-scripts
sd-scripts copied to clipboard