stable-diffusion-webui
stable-diffusion-webui copied to clipboard
Dreambooth
Add basic UI implementation and stuff to unpack a selected checkpoint and then use it with Dreambooth.
There's also code to re-merge the output with said selected checkpoint, but I can't currently test with my potato because I don't know how to incorporate the necessary "accelerate launch" command to make it only run on GPU.
@AUTOMATIC1111 - Need help with this bit. It's useless to me if I can't get the accelerate launch stuff to work so I can force it just to my GPU, unless you know some other magick to make it work with 8GB.
Also, @AUTOMATIC1111, if you could check your reddit, I sent you a PM.
Naive question… but what does this PR allow users to do? Have you found a way to separate the Dreambooth “changes” and apply them on top of other CKPT ?
or is this to create dreambooth models via webui?
Naive question… but what does this PR allow users to do? Have you found a way to separate the Dreambooth “changes” and apply them on top of other CKPT ?
or is this to create dreambooth models via webui?
It should do all the things. First, you point it at an existing checkpoint, even a custom one.
Then, It'll extract the diffusion models for that checkpoint and set up a working directory for training.
Once set up, you tell it where your training images are, your input prompt, and your "classification" prompt. Set the number of training steps, and let it rip.
I don't have the progress bar, "intermediary images", or "save a checkpoint every N steps" bits added yet, but in theory, it should work to train. I can get it to throw an OOM error, which is what I'd expect since I'm not forcing it to run on my CPU yet.
BUT, once done, it should then take the Dreambooth generated files and merge them into the selected checkpoint, saving it along side the others.
Since I'm getting OOM errors and can't use it yet, I can't verify I have the "build a new checkpoint" parts right, but if there is a bug/mistake there, it should be fairly trivial to fix.
Is this supporting the 12gb VRAM GPUs or restricted to 3090 and better? I have a 12GB GPU... this is why I am asking.
UPDATE:
I answered my own question... A 3060 with 12GB won't cut it:
But this look like a nice PR for those with a 3090.
I'll try and see if I can get it working with a 3090 and some of the missing features in. Will edit this comment just in-case I don't get anywhere before Tues.
Notes for myself:
- save_data_every can be 0 (disabled)
- wrap_gradio_call func can return None (at least it is for me, will need to play in an ipython embed a bit)
File "/home/unknown/Development/stable-diffusion-webui/modules/dreambooth/dreambooth.py", line 386, in train
if not global_step % self.save_data_every:
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
File "/home/unknown/Development/stable-diffusion-webui/modules/ui.py", line 188, in f
res = list(func(*args, **kwargs))
TypeError: 'NoneType' object is not iterable
To work on a 3090 with 12GB you need to use deepspeed.
accelerate launch --use_deepspeed --zero_stage=2 --gradient_accumulation_steps=1 --offload_param_device=cpu --offload_optimizer_device=cpu train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME --use_auth_token \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="a photo of sks dog" \
--class_prompt="a photo of dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--gradient_checkpointing \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800 \
--sample_batch_size=2 \
--mixed_precision=fp16
this is from pinkred's comment on the diffusers patch - https://github.com/huggingface/diffusers/pull/735
Note that TTL had to also do explicit casts rather than relying on auto to ensure that everything stayed 16bit.
In hindsight it might be better to just have diffusers as an optional dependency in repositories/ like xformers is; Instead of redistributing 2 py files from it in repo.
In hindsight it might be better to just have diffusers as an optional dependency in repositories/ like xformers is; Instead of redistributing 2 py files from it in repo.
I'm only using one file from the HD repo, and it's pretty heavily modified, so not really re-distributed...
Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM. (The TTL implementation is supposed to run at 8gb per his account)
Perhaps you could integrate those changes… allow to run on 8gb apparently: https://www.reddit.com/r/StableDiffusion/comments/xzbc2h/guide_for_dreambooth_with_8gb_vram_under_windows/?utm_source=share&utm_medium=ios_app&utm_name=iossmf
On Mon, Oct 10, 2022 at 4:30 AM devilismyfriend @.***> wrote:
Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM.
— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/2002#issuecomment-1272960862, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZA34QNSCOFBHOUR4FKDMDWCPH3NANCNFSM6AAAAAARAMZOXE . You are receiving this because you commented.Message ID: @.***>
Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM. (The TTL implementation is supposed to run at 8gb per his account)
Weird, it's like I almost mention in my initial commit that I currently cant get this version to run due to OOM errors, which is specifically because I'm asking for help with the launch accelerate commands needed to make it run under 8GB. :P
Perhaps you could integrate those changes… allow to run on 8gb apparently: https://www.reddit.com/r/StableDiffusion/comments/xzbc2h/guide_for_dreambooth_with_8gb_vram_under_windows/?utm_source=share&utm_medium=ios_app&utm_name=iossmf … On Mon, Oct 10, 2022 at 4:30 AM devilismyfriend @.> wrote: Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM. — Reply to this email directly, view it on GitHub <#2002 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZA34QNSCOFBHOUR4FKDMDWCPH3NANCNFSM6AAAAAARAMZOXE . You are receiving this because you commented.Message ID: @.>
"accelerate config" is literally how I have the stand-alone version running, on windows, on 8GB right now. It's why I chose the base diffusers repo, and it's what I'm asking @AUTOMATIC1111 or anybody else for a bit of help with. ;)
I will try the manual method today and then poke at things to see if I can figure something if I can get thing running manually. I have close to zero Python experience so not much hope but who knows.
OK... I see what you are talking about.. the issue is that the activation can't be done using the python script... and this is what is causing the issue. Just for a test... what if activation was done before starting webui? Would that solve this issue?
OK... I see what you are talking about.. the issue is that the activation can't be done using the python script... and this is what is causing the issue. Just for a test... what if activation was done before starting webui? Would that solve this issue?
What do you mean by "activation"? It would either be up to the user to run "accelerate config" to set the required params (or maybe do it with a script, launch.py, etc.). The bit I need to understand is how I can run "accelerate launch" from within the UI, versus from the command-line as it's documented. I think it's possible, but I haven't tested yet.
I see. On my side I am stuck trying to make it work manually... until I can do that even the UI won't work. I have done all the installation and config but when I try to run things I get:
[2022-10-10 09:51:23,004] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-10-10 09:51:23,005] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB Max_MA 1.66 GB CA 3.27 GB Max_CA 3 GB
[2022-10-10 09:51:23,005] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 7.72 GB, percent = 49.5%
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9298) of binary: /home/bernard/anaconda3/envs/diffusers/bin/python
I see. On my side I am stuck trying to make it work manually... until I can do that even the UI won't work. I have done all the installation and config but when I try to run things I get:
[2022-10-10 09:51:23,004] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states [2022-10-10 09:51:23,005] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB Max_MA 1.66 GB CA 3.27 GB Max_CA 3 GB [2022-10-10 09:51:23,005] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 7.72 GB, percent = 49.5% ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9298) of binary: /home/bernard/anaconda3/envs/diffusers/bin/python
Give the latest commit I just made a try. Be sure to set --medvram in the COMMAND_LINE_ARGS of your launch script, or set it however. I wired in the "notebook_launcher" class from Accelerate, and then forced it to run only on CPU if medvram or lowvram is set.
Haven't verified that it trains myself, yet...but my indicator of early success has been how long the "caching latents" portion takes. If it goes fast, it's gonna OOM. If it's running slow (as it is now), then training will run after that call.
The good news:
I can make it work on an 8GB GPU now, from the UI.
The bad news: It's abysmally slow, seemingly more so than when I run it manually. I suspect there are other things that can be done to make it faster...but I'll need to futz with it more.
Also, still no progress bar, no, way to interrupt/resume training, and no preview in the UI. But, hey, it will run. Progress!
The latest version fail as soon as I hit train with:
return torch.cuda.is_available()
[Previous line repeated 983 more times]
RecursionError: maximum recursion depth exceeded
The latest version fail as soon as I hit train with:
return torch.cuda.is_available() [Previous line repeated 983 more times] RecursionError: maximum recursion depth exceeded
Yeah, my bad. Dumb coding error. Fixed already, do another pull.
Hummm... when using --medvram I get:
NVIDIA GeForce RTX 3060 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3060 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
I guess this is not supposed to be like that...
Hummm... to pass the --medvram I need to use python launch --medvram... and this is what gives the cuda error.
I usually just run bash webui.sh to launch webui but that one does not pass parameters...
Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM. (The TTL implementation is supposed to run at 8gb per his account)
Weird, it's like I almost mention in my initial commit that I currently cant get this version to run due to OOM errors, which is specifically because I'm asking for help with the launch accelerate commands needed to make it run under 8GB. :P
sorry it was supposed to be a reply to Thomas with the launch commands, for me it's OOM during the accelerate.prepare in the code
Hummm... to pass the --medvram I need to use python launch --medvram... and this is what gives the cuda error.
I usually just run bash webui.sh to launch webui but that one does not pass parameters...
The proper way to set env args for our little project is like so:
OK, I sorted out my --medvram issue. Now, when I try training I get:
Starting Dreambooth training...
Launching training on CPU.
***** Running training *****
Num examples = 24
Num batches each epoch = 24
Num Epochs = 209
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 1
Gradient Accumulation steps = 1
Total optimization steps = 5000
Steps: 0%| | 0/5000 [00:00<?, ?it/s]
Killed
OK, I sorted out my --,edvram issue. Now, when I try training I get:
Starting Dreambooth training... Launching training on CPU. ***** Running training ***** Num examples = 24 Num batches each epoch = 24 Num Epochs = 209 Instantaneous batch size per device = 1 Total train batch size (w. parallel, distributed & accumulation) = 1 Gradient Accumulation steps = 1 Total optimization steps = 5000 Steps: 0%| | 0/5000 [00:00<?, ?it/s] Killed
How are you running Stable-Diffusion? You shouldn't need any "launch" or "accelerate" commands. Put the --medvram flag in your webui-user.sh file, run that.
OK... launching it like that worked for one step then killed itself:
Starting Dreambooth training...
Launching training on CPU.
***** Running training *****
Num examples = 24
Num batches each epoch = 24
Num Epochs = 209
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 1
Gradient Accumulation steps = 1
Total optimization steps = 5000
Steps: 0%| | 1/5000 [00:57<79:56:12, 57.57s/it, loss=0.116, lr=5e-6]webui.sh: line 141: 11609 Killed "${python_cmd}" "${LAUNCH_SCRIPT}"
OK... launching it like that worked for one step then killed itself:
Starting Dreambooth training... Launching training on CPU. ***** Running training ***** Num examples = 24 Num batches each epoch = 24 Num Epochs = 209 Instantaneous batch size per device = 1 Total train batch size (w. parallel, distributed & accumulation) = 1 Gradient Accumulation steps = 1 Total optimization steps = 5000 Steps: 0%| | 1/5000 [00:57<79:56:12, 57.57s/it, loss=0.116, lr=5e-6]webui.sh: line 141: 11609 Killed "${python_cmd}" "${LAUNCH_SCRIPT}"
That is...odd. It looks like something via the web UI is killing the process?? IDK if there's some memory management thing implemented? It's working for me on windoze. I wonder if @AUTOMATIC1111 knows anything about this "Killed" message.
But even it this worked… at 54sec per steps this would be a no go…
But even it this worked… at 54sec per steps this would be a no go…
Hey, Rome wasn't built in a day, and something is better than nothing. Did I say it was fast? Nope. Did I say it was possible? Yep.
Am I hoping that people who know more about pytorch and all this jazz will be able to take my slow, basic implementation and further optimize it to be useable? Yes.
It's a proof-of-concept, and as it's currently the best solution I've found that doesn't require buying a new GPU or renting rack space somewhere, I'll take it.
Plus, on earlier test with the manual way, I was able to train ~4000 steps in about 8 hours. Again, not fantastic, but it was faster than this. So, still a WIP.