sd_dreambooth_extension icon indicating copy to clipboard operation
sd_dreambooth_extension copied to clipboard

Lora training step time >2x increase after checkpoint

Open hdon96 opened this issue 2 years ago • 4 comments

' Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] Commit hash: 685f9631b56ff8bd43bce24ff5ce0f9a0e9af490 Installing requirements for Web UI [auto-sd-paint-ext] Attempting auto-update... [auto-sd-paint-ext] Fetch upstream. [auto-sd-paint-ext] Pull upstream.

Installing requirements for scikit_learn

####################################################################################################### Initializing Dreambooth If submitting an issue on github, please provide the below text for debugging purposes:

Python revision: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] Dreambooth revision: f2ba1dbdd0d8eaeba9502d69355d74c3044fe432 SD-WebUI revision: 685f9631b56ff8bd43bce24ff5ce0f9a0e9af490

Checking Dreambooth requirements... [+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.15.dev0+c101579.d20221117 installed. [+] torch version 1.12.1+cu116 installed. [+] torchvision version 0.13.1+cu116 installed. ####################################################################################################### '

Have you read the Readme? yes Have you completely restarted the stable-diffusion-webUI, not just reloaded the UI? yes Have you updated Dreambooth to the latest revision? yes Have you updated the Stable-Diffusion-WebUI to the latest version? yes No, really. Please save us both some trouble and update the SD-WebUI and Extension and restart before posting this. Reply 'OK' Below to acknowledge that you did this. OK Describe the bug

Lora training begins taking twice as long per step after the first training checkpoint where the model is saved and sampled. From sample 500 there was ~8 s/it and then after saving it goes to ~17 s/it for some reason.

Provide logs

Compiling checkpoint for ... Applying lora model... Saving checkpoint to F:SD_MODELS<model>_250_lora.ckpt... Generating samples: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:21<00:00, 21.57s/it] [*] Weights saved at C:\Users<user>\stable-diffusion-webui\models\dreambooth<model>1/1 [00:21<00:00, 21.57s/it] Steps: 1%|▎ | 500/42400 [1:01:59<91:42:28, 7.88s/it, loss=0.846, lr=1e-6, vram=5.9/9.6GB] Saving lora weights at step 500 Allocated 7.5/9.3GB Reserved: 8.5/9.8GB

Compiling checkpoint for ... Applying lora model... Saving checkpoint to F:SD_MODELS<model>_500_lora.ckpt... Generating samples: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:22<00:00, 22.39s/it] [*] Weights saved at C:\Users<user>\stable-diffusion-webui\models\dreambooth<model>1/1 [00:22<00:00, 22.39s/it] Steps: 1%|▍ | 552/42400 [1:17:41<198:33:31, 17.08s/it, loss=1.1, lr=1e-6, vram=5.9/10.2GB]

Environment

What OS? Windows If Windows - WSL or native? native Windows What GPU are you using? 1080 Ti (11gb) Screenshots/Config If the issue is specific to an error while training, please provide a screenshot of training parameters or the db_config.json file from /models/dreambooth/MODELNAME/db_config.json

{ "adam_beta1": 0.9, "adam_beta2": 0.999, "adam_epsilon": 1e-08, "adam_weight_decay": 0.01, "attention": "xformers", "center_crop": false, "concepts_path": "", "epoch_pause_frequency": 0.0, "epoch_pause_time": 60.0, "gradient_accumulation_steps": 1, "gradient_checkpointing": true, "half_model": false, "hflip": false, "learning_rate": 1e-06, "lr_scheduler": "constant", "lr_warmup_steps": 500, "max_grad_norm": 1, "max_token_length": 75, "max_train_steps": 0, "mixed_precision": "fp16", "model_dir": "C:\Users\\stable-diffusion-webui\models\dreambooth\", "model_name": "", "not_cache_latents": false, "num_train_epochs": 100, "pad_tokens": true, "pretrained_model_name_or_path": "C:\Users\\stable-diffusion-webui\models\dreambooth\\working", "pretrained_vae_name_or_path": null, "prior_loss_weight": 1, "resolution": 768, "revision": 500, "sample_batch_size": 1, "save_class_txt": true, "save_embedding_every": 250, "save_preview_every": 250, "save_use_global_counts": true, "save_use_epochs": true, "scale_lr": true, "src": "F:SD_MODELS\v2-1_768-ema-pruned.ckpt", "shuffle_tags": false, "train_batch_size": 1, "train_text_encoder": true, "use_8bit_adam": true, "use_concepts": false, "use_cpu": false, "use_ema": false, "use_lora": true, "scheduler": "ddim", "v2": true, "has_ema": "False", "concepts_list": [ { "max_steps": -1, "instance_data_dir": "<instance_path", "class_data_dir": "<class_path>", "instance_prompt": "<instance_prompt>", "class_prompt": "<class_prompt>", "save_sample_prompt": "", "save_sample_template": "", "instance_token": "", "class_token": "", "num_class_images": 2034, "class_negative_prompt": "", "class_guidance_scale": 9, "class_infer_steps": 30, "save_sample_negative_prompt": "", "n_save_sample": 1, "sample_seed": -1, "save_guidance_scale": 9, "save_infer_steps": 30 } ], "lifetime_revision": 500 }

hdon96 avatar Dec 12 '22 01:12 hdon96

Somewhere after 750 the speed went back up. Not sure why, no different activity on the computer.

hdon96 avatar Dec 12 '22 03:12 hdon96

I think it was a background program causing the slowdown

hdon96 avatar Dec 12 '22 04:12 hdon96

Not a background program, observed the same pattern of speedup and slowdown between checkpoints with no background processes running.

hdon96 avatar Dec 12 '22 17:12 hdon96

I also gave up on training session last night because I kept ending up at 7s or 3s per it when I usually train at ~1.6s/it.

Is this what you are referring to?

james-things avatar Dec 12 '22 17:12 james-things

I also gave up on training session last night because I kept ending up at 7s or 3s per it when I usually train at ~1.6s/it.

Is this what you are referring to?

Yes this exactly

hdon96 avatar Dec 12 '22 18:12 hdon96

same here from 1-2it/s to 1-2s/it after update webui and dreambooth

AndyLonly avatar Dec 12 '22 21:12 AndyLonly

I may have made progress figuring this one out, are you both using "Apply Horizontal Flip" by chance? Try without if you are. Seems to have made the difference for me, I will look at it more after this training.

Edit: nvm, issue cropping up again after saving preview image

james-things avatar Dec 12 '22 22:12 james-things

image The speed is now right after the update

AndyLonly avatar Dec 13 '22 02:12 AndyLonly

resolved in new updates! thanks!

hdon96 avatar Dec 13 '22 03:12 hdon96