sd-scripts
sd-scripts copied to clipboard
the code of new version is using more VRAM
Hey, I am encountering the same problem today!!
I have two cloned codes of sd-scripts. One was cloned in 12,2023, and the other was downloaded in 2024, Feb 28.
But I found the new code always reported"out of memory" by using the same configuration as follows:
--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \ --vae=madebyollin/sdxl-vae-fp16-fix \ --dataset_config=/home/lyh/sdvs/sd-scripts/config/finetune.toml \ --output_dir=/home/lyh/sd-scripts/output/finetune_15W \ --output_name=finetune_15W \ --save_model_as=safetensors \ --save_every_n_epochs=1 \ --save_precision="fp16" \ --max_token_length=225 \ --min_timestep=0 \ --max_timestep=1000 \ --max_train_epochs=2000 \ --learning_rate=4e-6 \ --lr_scheduler="constant" \ --optimizer_type="AdamW8bit" \ --xformers \ --gradient_checkpointing \ --gradient_accumulation_steps=128 \ --mem_eff_attn \ --mixed_precision="fp16" \ --logging_dir=logs \
The wired thing comes:
the VRAM occupation with new code:
the VRAM occupation with old code:
why? what is different?
The PR #989 was merged into main branch 21, Dec 2023. I think it may cause this issue. Please add --ddp_gradient_as_bucket_view
and --ddp_static_graph
to reduce VRAM usage with multi GPU training.
@kohya-ss these flags didn't help me with multi-gpu training... I have 3x4090 onboard. This is how I run sdxl_train.py with single GPU (I setup it for accelerate as well):
accelerate launch --num_cpu_threads_per_process=2 "/home/storuky/ml/train/kohya/sd-scripts/sdxl_train.py" --cache_text_encoder_outputs --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=1024 --gradient_checkpointing --learning_rate="2e-06" --logging_dir="/home/storuky/ml/train/data_dir/log" --lr_scheduler="constant" --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3200" --mixed_precision="bf16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --optimizer_type="Adafactor" --output_dir="/home/storuky/ml/train/data_dir/model" --output_name="OutModel" --pretrained_model_name_or_path="/home/storuky/ml/sd/stable-diffusion-webui-forge/models/Stable-diffusion/Training2-000005.safetensors" --reg_data_dir="/home/storuky/ml/train/data_dir/reg" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="bf16" --train_batch_size="1" --train_data_dir="/home/storuky/ml/train/data_dir/img" --xformers
It's going well.
But when I configure accelerate to use 3 gpu and run this command with --multi_gpu
and --num_processes=3
:
accelerate launch --num_processes=3 --multi_gpu --num_cpu_threads_per_process=2 "/home/storuky/ml/train/kohya/sd-scripts/sdxl_train.py" --cache_text_encoder_outputs --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=1024 --gradient_checkpointing --learning_rate="2e-06" --logging_dir="/home/storuky/ml/train/data_dir/log" --lr_scheduler="constant" --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3200" --mixed_precision="bf16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --optimizer_type="Adafactor" --output_dir="/home/storuky/ml/train/data_dir/model" --output_name="OutModel" --pretrained_model_name_or_path="/home/storuky/ml/sd/stable-diffusion-webui-forge/models/Stable-diffusion/Training2-000005.safetensors" --reg_data_dir="/home/storuky/ml/train/data_dir/reg" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="bf16" --train_batch_size="1" --train_data_dir="/home/storuky/ml/train/data_dir/img" --xformers
I'm getting OOM error. I tried to add --ddp_gradient_as_bucket_view
and --ddp_static_graph
as you mentioned but I'm still getting OOM.
I reverted this PR locally and now it uses less VRAM.
PR #989 fixes gradients synchronization. If #989 is reverted, the gradient is not synchronized, so it is similar to single GPU training in my understanding.
I'm not familiar with multiple GPU training, but could you try the training with --full_bf16
option? If it works, there might be some overhead in multi GPU training, and 24gb may not be sufficient.
@kohya-ss yes, full_bf16 works well (if we talk about VRAM usage) but it has much worse results in terms of accuracy 🤷♂️ For example, hair sticks together as dirty. Small detailed objects turn into blots... etc... Probably, full_bf16 may need another optimizer/LR/Scheduler setup... Do you have some handy notes of what we need to know about full_bf16?