fused_backward_pass in prodigy-plus-schedule-free [SOLVED VIA EXTERNAL SOLUTION ✅]
Hi, Kohya. I know you hardcoded fused_backward_pass to Adafactor, but prodigy-plus-schedule-free https://github.com/LoganBooker/prodigy-plus-schedule-free has that feature inside already, but we cant use it. That is, in fact, we can apply the built-in argument itself, but it breaks the training process. Can you add more flexibility in this case, please?
#1866
#1866
I tried both the Lora algorithm and the Glora+Dora algorithm on SDXL - no noticeable decrease in VRAM usage. Speeds are the same with and without fused_backward_pass also.
Settings for example:
accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^ --pretrained_model_name_or_path="model.safetensors" ^ --train_data_dir="LP3" ^ --output_dir="output_dir" ^ --output_name="LP3-glora-dora-1024p-batch1-reso2k-psteps500-biastrue-d1e4-lossl2-dcoef01-unetlr1-snr1-randcrop-bigasp-dropout01-conv32conv1-netdim32netalpha1-trainnorm-fbwp" ^ --network_args "algo=glora" "dropout=0.1" "conv_dim=32" "conv_alpha=1" "train_norm=True" "dora_wd=True" ^ --resolution="1024,1024" ^ --save_model_as="safetensors" ^ --network_module="lycoris.kohya" ^ --max_train_steps=1000 ^ --save_every_n_epochs=1 ^ --save_every_n_steps=100 ^ --save_state_on_train_end ^ --network_dim=32 ^ --network_alpha=1 ^ --train_batch_size=1 ^ --max_data_loader_n_workers=0 ^ --random_crop ^ --enable_bucket ^ --bucket_reso_steps=64 ^ --min_bucket_reso=768 ^ --max_bucket_reso=2048 ^ --mixed_precision="bf16" ^ --caption_extension=".txt" ^ --noise_offset=0.05 ^ --multires_noise_discount=0.2 ^ --multires_noise_iterations=7 ^ --gradient_checkpointing ^ --fused_backward_pass ^ --optimizer_type="prodigyplus.ProdigyPlusScheduleFree" ^ --optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False" "weight_decay_by_lr=True" ^ --loss_type="l2" ^ --unet_lr=1.0 ^ --network_train_unet_only ^ --min_snr_gamma=1 ^ --prior_loss_weight=1 ^ --seed=0 ^ --logging_dir="logs" ^
#1866
I tried both the Lora algorithm and the Glora+Dora algorithm on SDXL - no noticeable decrease in VRAM usage. Speeds are the same with and without fused_backward_pass also.
Settings for example:
accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^ --pretrained_model_name_or_path="model.safetensors" ^ --train_data_dir="LP3" ^ --output_dir="output_dir" ^ --output_name="LP3-glora-dora-1024p-batch1-reso2k-psteps500-biastrue-d1e4-lossl2-dcoef01-unetlr1-snr1-randcrop-bigasp-dropout01-conv32conv1-netdim32netalpha1-trainnorm-fbwp" ^ --network_args "algo=glora" "dropout=0.1" "conv_dim=32" "conv_alpha=1" "train_norm=True" "dora_wd=True" ^ --resolution="1024,1024" ^ --save_model_as="safetensors" ^ --network_module="lycoris.kohya" ^ --max_train_steps=1000 ^ --save_every_n_epochs=1 ^ --save_every_n_steps=100 ^ --save_state_on_train_end ^ --network_dim=32 ^ --network_alpha=1 ^ --train_batch_size=1 ^ --max_data_loader_n_workers=0 ^ --random_crop ^ --enable_bucket ^ --bucket_reso_steps=64 ^ --min_bucket_reso=768 ^ --max_bucket_reso=2048 ^ --mixed_precision="bf16" ^ --caption_extension=".txt" ^ --noise_offset=0.05 ^ --multires_noise_discount=0.2 ^ --multires_noise_iterations=7 ^ --gradient_checkpointing ^ --fused_backward_pass ^ --optimizer_type="prodigyplus.ProdigyPlusScheduleFree" ^ --optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False" "weight_decay_by_lr=True" ^ --loss_type="l2" ^ --unet_lr=1.0 ^ --network_train_unet_only ^ --min_snr_gamma=1 ^ --prior_loss_weight=1 ^ --seed=0 ^ --logging_dir="logs" ^
Try it without my propesed changes, apparently it was already working if you set --fused_back_pass as an optimizer arg
Try it without my propesed changes, apparently it was already working if you set --fused_back_pass as an optimizer arg
Tested it earlier:
- back_pass as an optimizer argument is not working in the dev branch: LoRA is simply not working—no changes in the resulting image after tons of steps/epochs. If retrained without this argument, everything works as expected.
- back_pass as an optimizer argument is not working in your branch either: the effect is the same as in option 1.
The speeds and other metrics are about 30% faster than usual during the training process. I don’t remember the exact VRAM usage, but there was no significant decrease.
I also used --fused_backward_pass together with the optimizer argument, but the resulting LoRA is not working, just like in the cases above.
Saying 'lora not working' I mean it trains without errors or NaNs, tensorboard shows regular graphs, but if I use resulting lora with the model - no changes for the image even with 1000 weight power.
Problem solved in new version of https://github.com/LoganBooker/prodigy-plus-schedule-free According issue thread https://github.com/LoganBooker/prodigy-plus-schedule-free/issues/7 now it works with full finetunes and loras (the problem was the lack of Fused support for LoRa in sd-scripts).
@deGENERATIVE-SQUAD what optimizer parameters are required? i did trainings and they are all equal no learning :D
are these only things?
what is d0=1e-4 for?
--optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False"
@FurkanGozukara
what optimizer parameters are required?
Basically: d0, eps, d_coef, use_stableadamw, and stochastic_rounding Optionally: use_bias_correction, factored, weight_decay, split_groups (for learning rate splitting between U-Net and TE).
i did trainings and they are all equal no learning :D
Please show your full config.
what is d0=1e-4 for?
It represents the basic learning rate "floor". d_coef is the multiplier, and the standard learning rates for TE and U-Net act as additional multipliers, which default to 1 in the base configuration.
--optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False"
d0 depends on your dataset, algorithm, dampening, etc. Steps are not required in the latest optimizer release. d_coef acts like "multiplier". You reduced the floor of 1e-4 by 10x with 0.1. It’s fine to use it within a range of 0.5 to 2, depending on your algorithm. Bias correction slows down the adaptation.