sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

fused_backward_pass in prodigy-plus-schedule-free [SOLVED VIA EXTERNAL SOLUTION ✅]

Open deGENERATIVE-SQUAD opened this issue 1 year ago • 7 comments

Hi, Kohya. I know you hardcoded fused_backward_pass to Adafactor, but prodigy-plus-schedule-free https://github.com/LoganBooker/prodigy-plus-schedule-free has that feature inside already, but we cant use it. That is, in fact, we can apply the built-in argument itself, but it breaks the training process. Can you add more flexibility in this case, please?

deGENERATIVE-SQUAD avatar Dec 13 '24 00:12 deGENERATIVE-SQUAD

#1866

michP247 avatar Jan 06 '25 03:01 michP247

#1866

I tried both the Lora algorithm and the Glora+Dora algorithm on SDXL - no noticeable decrease in VRAM usage. Speeds are the same with and without fused_backward_pass also.

Settings for example:

accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^ --pretrained_model_name_or_path="model.safetensors" ^ --train_data_dir="LP3" ^ --output_dir="output_dir" ^ --output_name="LP3-glora-dora-1024p-batch1-reso2k-psteps500-biastrue-d1e4-lossl2-dcoef01-unetlr1-snr1-randcrop-bigasp-dropout01-conv32conv1-netdim32netalpha1-trainnorm-fbwp" ^ --network_args "algo=glora" "dropout=0.1" "conv_dim=32" "conv_alpha=1" "train_norm=True" "dora_wd=True" ^ --resolution="1024,1024" ^ --save_model_as="safetensors" ^ --network_module="lycoris.kohya" ^ --max_train_steps=1000 ^ --save_every_n_epochs=1 ^ --save_every_n_steps=100 ^ --save_state_on_train_end ^ --network_dim=32 ^ --network_alpha=1 ^ --train_batch_size=1 ^ --max_data_loader_n_workers=0 ^ --random_crop ^ --enable_bucket ^ --bucket_reso_steps=64 ^ --min_bucket_reso=768 ^ --max_bucket_reso=2048 ^ --mixed_precision="bf16" ^ --caption_extension=".txt" ^ --noise_offset=0.05 ^ --multires_noise_discount=0.2 ^ --multires_noise_iterations=7 ^ --gradient_checkpointing ^ --fused_backward_pass ^ --optimizer_type="prodigyplus.ProdigyPlusScheduleFree" ^ --optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False" "weight_decay_by_lr=True" ^ --loss_type="l2" ^ --unet_lr=1.0 ^ --network_train_unet_only ^ --min_snr_gamma=1 ^ --prior_loss_weight=1 ^ --seed=0 ^ --logging_dir="logs" ^

deGENERATIVE-SQUAD avatar Jan 08 '25 05:01 deGENERATIVE-SQUAD

#1866

I tried both the Lora algorithm and the Glora+Dora algorithm on SDXL - no noticeable decrease in VRAM usage. Speeds are the same with and without fused_backward_pass also.

Settings for example:

accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^ --pretrained_model_name_or_path="model.safetensors" ^ --train_data_dir="LP3" ^ --output_dir="output_dir" ^ --output_name="LP3-glora-dora-1024p-batch1-reso2k-psteps500-biastrue-d1e4-lossl2-dcoef01-unetlr1-snr1-randcrop-bigasp-dropout01-conv32conv1-netdim32netalpha1-trainnorm-fbwp" ^ --network_args "algo=glora" "dropout=0.1" "conv_dim=32" "conv_alpha=1" "train_norm=True" "dora_wd=True" ^ --resolution="1024,1024" ^ --save_model_as="safetensors" ^ --network_module="lycoris.kohya" ^ --max_train_steps=1000 ^ --save_every_n_epochs=1 ^ --save_every_n_steps=100 ^ --save_state_on_train_end ^ --network_dim=32 ^ --network_alpha=1 ^ --train_batch_size=1 ^ --max_data_loader_n_workers=0 ^ --random_crop ^ --enable_bucket ^ --bucket_reso_steps=64 ^ --min_bucket_reso=768 ^ --max_bucket_reso=2048 ^ --mixed_precision="bf16" ^ --caption_extension=".txt" ^ --noise_offset=0.05 ^ --multires_noise_discount=0.2 ^ --multires_noise_iterations=7 ^ --gradient_checkpointing ^ --fused_backward_pass ^ --optimizer_type="prodigyplus.ProdigyPlusScheduleFree" ^ --optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False" "weight_decay_by_lr=True" ^ --loss_type="l2" ^ --unet_lr=1.0 ^ --network_train_unet_only ^ --min_snr_gamma=1 ^ --prior_loss_weight=1 ^ --seed=0 ^ --logging_dir="logs" ^

Try it without my propesed changes, apparently it was already working if you set --fused_back_pass as an optimizer arg

michP247 avatar Jan 08 '25 12:01 michP247

Try it without my propesed changes, apparently it was already working if you set --fused_back_pass as an optimizer arg

Tested it earlier:

  1. back_pass as an optimizer argument is not working in the dev branch: LoRA is simply not working—no changes in the resulting image after tons of steps/epochs. If retrained without this argument, everything works as expected.
  2. back_pass as an optimizer argument is not working in your branch either: the effect is the same as in option 1.

The speeds and other metrics are about 30% faster than usual during the training process. I don’t remember the exact VRAM usage, but there was no significant decrease.

I also used --fused_backward_pass together with the optimizer argument, but the resulting LoRA is not working, just like in the cases above.

Saying 'lora not working' I mean it trains without errors or NaNs, tensorboard shows regular graphs, but if I use resulting lora with the model - no changes for the image even with 1000 weight power.

deGENERATIVE-SQUAD avatar Jan 08 '25 13:01 deGENERATIVE-SQUAD

Problem solved in new version of https://github.com/LoganBooker/prodigy-plus-schedule-free According issue thread https://github.com/LoganBooker/prodigy-plus-schedule-free/issues/7 now it works with full finetunes and loras (the problem was the lack of Fused support for LoRa in sd-scripts).

deGENERATIVE-SQUAD avatar Jan 10 '25 18:01 deGENERATIVE-SQUAD

@deGENERATIVE-SQUAD what optimizer parameters are required? i did trainings and they are all equal no learning :D

are these only things?

what is d0=1e-4 for?

--optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False"

FurkanGozukara avatar Feb 25 '25 09:02 FurkanGozukara

@FurkanGozukara

what optimizer parameters are required?

Basically: d0, eps, d_coef, use_stableadamw, and stochastic_rounding Optionally: use_bias_correction, factored, weight_decay, split_groups (for learning rate splitting between U-Net and TE).

i did trainings and they are all equal no learning :D

Please show your full config.

what is d0=1e-4 for?

It represents the basic learning rate "floor". d_coef is the multiplier, and the standard learning rates for TE and U-Net act as additional multipliers, which default to 1 in the base configuration.

--optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False"

d0 depends on your dataset, algorithm, dampening, etc. Steps are not required in the latest optimizer release. d_coef acts like "multiplier". You reduced the floor of 1e-4 by 10x with 0.1. It’s fine to use it within a range of 0.5 to 2, depending on your algorithm. Bias correction slows down the adaptation.

deGENERATIVE-SQUAD avatar Mar 01 '25 20:03 deGENERATIVE-SQUAD