stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

Dreambooth

Open d8ahazard opened this issue 1 year ago • 139 comments

Add basic UI implementation and stuff to unpack a selected checkpoint and then use it with Dreambooth.

There's also code to re-merge the output with said selected checkpoint, but I can't currently test with my potato because I don't know how to incorporate the necessary "accelerate launch" command to make it only run on GPU.

@AUTOMATIC1111 - Need help with this bit. It's useless to me if I can't get the accelerate launch stuff to work so I can force it just to my GPU, unless you know some other magick to make it work with 8GB.

d8ahazard avatar Oct 08 '22 21:10 d8ahazard

Also, @AUTOMATIC1111, if you could check your reddit, I sent you a PM.

d8ahazard avatar Oct 08 '22 21:10 d8ahazard

Naive question… but what does this PR allow users to do? Have you found a way to separate the Dreambooth “changes” and apply them on top of other CKPT ?

or is this to create dreambooth models via webui?

bmaltais avatar Oct 09 '22 12:10 bmaltais

Naive question… but what does this PR allow users to do? Have you found a way to separate the Dreambooth “changes” and apply them on top of other CKPT ?

or is this to create dreambooth models via webui?

It should do all the things. First, you point it at an existing checkpoint, even a custom one.

Then, It'll extract the diffusion models for that checkpoint and set up a working directory for training.

Once set up, you tell it where your training images are, your input prompt, and your "classification" prompt. Set the number of training steps, and let it rip.

I don't have the progress bar, "intermediary images", or "save a checkpoint every N steps" bits added yet, but in theory, it should work to train. I can get it to throw an OOM error, which is what I'd expect since I'm not forcing it to run on my CPU yet.

BUT, once done, it should then take the Dreambooth generated files and merge them into the selected checkpoint, saving it along side the others.

Since I'm getting OOM errors and can't use it yet, I can't verify I have the "build a new checkpoint" parts right, but if there is a bug/mistake there, it should be fairly trivial to fix.

d8ahazard avatar Oct 09 '22 13:10 d8ahazard

Is this supporting the 12gb VRAM GPUs or restricted to 3090 and better? I have a 12GB GPU... this is why I am asking.

UPDATE:

I answered my own question... A 3060 with 12GB won't cut it:

image

But this look like a nice PR for those with a 3090.

bmaltais avatar Oct 09 '22 14:10 bmaltais

I'll try and see if I can get it working with a 3090 and some of the missing features in. Will edit this comment just in-case I don't get anywhere before Tues.

Notes for myself:

  • save_data_every can be 0 (disabled)
  • wrap_gradio_call func can return None (at least it is for me, will need to play in an ipython embed a bit)
  File "/home/unknown/Development/stable-diffusion-webui/modules/dreambooth/dreambooth.py", line 386, in train
    if not global_step % self.save_data_every:
ZeroDivisionError: integer division or modulo by zero

Traceback (most recent call last):
  File "/home/unknown/Development/stable-diffusion-webui/modules/ui.py", line 188, in f
    res = list(func(*args, **kwargs))
TypeError: 'NoneType' object is not iterable

mcd1992 avatar Oct 09 '22 19:10 mcd1992

To work on a 3090 with 12GB you need to use deepspeed.

accelerate launch --use_deepspeed --zero_stage=2 --gradient_accumulation_steps=1 --offload_param_device=cpu --offload_optimizer_device=cpu train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME --use_auth_token \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="a photo of sks dog" \
  --class_prompt="a photo of dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --max_train_steps=800 \
  --sample_batch_size=2 \
  --mixed_precision=fp16

this is from pinkred's comment on the diffusers patch - https://github.com/huggingface/diffusers/pull/735

Note that TTL had to also do explicit casts rather than relying on auto to ensure that everything stayed 16bit.

Thomas-MMJ avatar Oct 09 '22 19:10 Thomas-MMJ

In hindsight it might be better to just have diffusers as an optional dependency in repositories/ like xformers is; Instead of redistributing 2 py files from it in repo.

mcd1992 avatar Oct 09 '22 22:10 mcd1992

In hindsight it might be better to just have diffusers as an optional dependency in repositories/ like xformers is; Instead of redistributing 2 py files from it in repo.

I'm only using one file from the HD repo, and it's pretty heavily modified, so not really re-distributed...

d8ahazard avatar Oct 09 '22 22:10 d8ahazard

Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM. (The TTL implementation is supposed to run at 8gb per his account)

devilismyfriend avatar Oct 10 '22 08:10 devilismyfriend

Perhaps you could integrate those changes… allow to run on 8gb apparently: https://www.reddit.com/r/StableDiffusion/comments/xzbc2h/guide_for_dreambooth_with_8gb_vram_under_windows/?utm_source=share&utm_medium=ios_app&utm_name=iossmf

On Mon, Oct 10, 2022 at 4:30 AM devilismyfriend @.***> wrote:

Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM.

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/2002#issuecomment-1272960862, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZA34QNSCOFBHOUR4FKDMDWCPH3NANCNFSM6AAAAAARAMZOXE . You are receiving this because you commented.Message ID: @.***>

bmaltais avatar Oct 10 '22 10:10 bmaltais

Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM. (The TTL implementation is supposed to run at 8gb per his account)

Weird, it's like I almost mention in my initial commit that I currently cant get this version to run due to OOM errors, which is specifically because I'm asking for help with the launch accelerate commands needed to make it run under 8GB. :P

d8ahazard avatar Oct 10 '22 12:10 d8ahazard

Perhaps you could integrate those changes… allow to run on 8gb apparently: https://www.reddit.com/r/StableDiffusion/comments/xzbc2h/guide_for_dreambooth_with_8gb_vram_under_windows/?utm_source=share&utm_medium=ios_app&utm_name=iossmf On Mon, Oct 10, 2022 at 4:30 AM devilismyfriend @.> wrote: Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM. — Reply to this email directly, view it on GitHub <#2002 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZA34QNSCOFBHOUR4FKDMDWCPH3NANCNFSM6AAAAAARAMZOXE . You are receiving this because you commented.Message ID: @.>

"accelerate config" is literally how I have the stand-alone version running, on windows, on 8GB right now. It's why I chose the base diffusers repo, and it's what I'm asking @AUTOMATIC1111 or anybody else for a bit of help with. ;)

image

d8ahazard avatar Oct 10 '22 12:10 d8ahazard

I will try the manual method today and then poke at things to see if I can figure something if I can get thing running manually. I have close to zero Python experience so not much hope but who knows.

bmaltais avatar Oct 10 '22 12:10 bmaltais

OK... I see what you are talking about.. the issue is that the activation can't be done using the python script... and this is what is causing the issue. Just for a test... what if activation was done before starting webui? Would that solve this issue?

bmaltais avatar Oct 10 '22 13:10 bmaltais

OK... I see what you are talking about.. the issue is that the activation can't be done using the python script... and this is what is causing the issue. Just for a test... what if activation was done before starting webui? Would that solve this issue?

What do you mean by "activation"? It would either be up to the user to run "accelerate config" to set the required params (or maybe do it with a script, launch.py, etc.). The bit I need to understand is how I can run "accelerate launch" from within the UI, versus from the command-line as it's documented. I think it's possible, but I haven't tested yet.

d8ahazard avatar Oct 10 '22 13:10 d8ahazard

I see. On my side I am stuck trying to make it work manually... until I can do that even the UI won't work. I have done all the installation and config but when I try to run things I get:

[2022-10-10 09:51:23,004] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-10-10 09:51:23,005] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 3.27 GB         Max_CA 3 GB
[2022-10-10 09:51:23,005] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 7.72 GB, percent = 49.5%
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9298) of binary: /home/bernard/anaconda3/envs/diffusers/bin/python

bmaltais avatar Oct 10 '22 13:10 bmaltais

I see. On my side I am stuck trying to make it work manually... until I can do that even the UI won't work. I have done all the installation and config but when I try to run things I get:

[2022-10-10 09:51:23,004] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-10-10 09:51:23,005] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 3.27 GB         Max_CA 3 GB
[2022-10-10 09:51:23,005] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 7.72 GB, percent = 49.5%
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9298) of binary: /home/bernard/anaconda3/envs/diffusers/bin/python

Give the latest commit I just made a try. Be sure to set --medvram in the COMMAND_LINE_ARGS of your launch script, or set it however. I wired in the "notebook_launcher" class from Accelerate, and then forced it to run only on CPU if medvram or lowvram is set.

Haven't verified that it trains myself, yet...but my indicator of early success has been how long the "caching latents" portion takes. If it goes fast, it's gonna OOM. If it's running slow (as it is now), then training will run after that call.

d8ahazard avatar Oct 10 '22 14:10 d8ahazard

The good news:

I can make it work on an 8GB GPU now, from the UI.

image

The bad news: It's abysmally slow, seemingly more so than when I run it manually. I suspect there are other things that can be done to make it faster...but I'll need to futz with it more.

Also, still no progress bar, no, way to interrupt/resume training, and no preview in the UI. But, hey, it will run. Progress!

d8ahazard avatar Oct 10 '22 14:10 d8ahazard

The latest version fail as soon as I hit train with:

    return torch.cuda.is_available()
  [Previous line repeated 983 more times]
RecursionError: maximum recursion depth exceeded

bmaltais avatar Oct 10 '22 14:10 bmaltais

The latest version fail as soon as I hit train with:

    return torch.cuda.is_available()
  [Previous line repeated 983 more times]
RecursionError: maximum recursion depth exceeded

Yeah, my bad. Dumb coding error. Fixed already, do another pull.

d8ahazard avatar Oct 10 '22 14:10 d8ahazard

Hummm... when using --medvram I get:

NVIDIA GeForce RTX 3060 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3060 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

I guess this is not supposed to be like that...

bmaltais avatar Oct 10 '22 14:10 bmaltais

Hummm... to pass the --medvram I need to use python launch --medvram... and this is what gives the cuda error.

I usually just run bash webui.sh to launch webui but that one does not pass parameters...

bmaltais avatar Oct 10 '22 14:10 bmaltais

Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM. (The TTL implementation is supposed to run at 8gb per his account)

Weird, it's like I almost mention in my initial commit that I currently cant get this version to run due to OOM errors, which is specifically because I'm asking for help with the launch accelerate commands needed to make it run under 8GB. :P

sorry it was supposed to be a reply to Thomas with the launch commands, for me it's OOM during the accelerate.prepare in the code

devilismyfriend avatar Oct 10 '22 15:10 devilismyfriend

Hummm... to pass the --medvram I need to use python launch --medvram... and this is what gives the cuda error.

I usually just run bash webui.sh to launch webui but that one does not pass parameters...

The proper way to set env args for our little project is like so:

image

d8ahazard avatar Oct 10 '22 15:10 d8ahazard

OK, I sorted out my --medvram issue. Now, when I try training I get:

Starting Dreambooth training...
Launching training on CPU.
***** Running training *****
  Num examples = 24
  Num batches each epoch = 24
  Num Epochs = 209
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 5000
Steps:   0%|                                                                                   | 0/5000 [00:00<?, ?it/s]
Killed

bmaltais avatar Oct 10 '22 15:10 bmaltais

OK, I sorted out my --,edvram issue. Now, when I try training I get:

Starting Dreambooth training...
Launching training on CPU.
***** Running training *****
  Num examples = 24
  Num batches each epoch = 24
  Num Epochs = 209
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 5000
Steps:   0%|                                                                                   | 0/5000 [00:00<?, ?it/s]
Killed

How are you running Stable-Diffusion? You shouldn't need any "launch" or "accelerate" commands. Put the --medvram flag in your webui-user.sh file, run that.

d8ahazard avatar Oct 10 '22 15:10 d8ahazard

OK... launching it like that worked for one step then killed itself:

Starting Dreambooth training...
Launching training on CPU.
***** Running training *****
  Num examples = 24
  Num batches each epoch = 24
  Num Epochs = 209
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 5000
Steps:   0%|                                                   | 1/5000 [00:57<79:56:12, 57.57s/it, loss=0.116, lr=5e-6]webui.sh: line 141: 11609 Killed                  "${python_cmd}" "${LAUNCH_SCRIPT}"

bmaltais avatar Oct 10 '22 15:10 bmaltais

OK... launching it like that worked for one step then killed itself:

Starting Dreambooth training...
Launching training on CPU.
***** Running training *****
  Num examples = 24
  Num batches each epoch = 24
  Num Epochs = 209
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 5000
Steps:   0%|                                                   | 1/5000 [00:57<79:56:12, 57.57s/it, loss=0.116, lr=5e-6]webui.sh: line 141: 11609 Killed                  "${python_cmd}" "${LAUNCH_SCRIPT}"

That is...odd. It looks like something via the web UI is killing the process?? IDK if there's some memory management thing implemented? It's working for me on windoze. I wonder if @AUTOMATIC1111 knows anything about this "Killed" message.

d8ahazard avatar Oct 10 '22 15:10 d8ahazard

But even it this worked… at 54sec per steps this would be a no go…

bmaltais avatar Oct 10 '22 15:10 bmaltais

But even it this worked… at 54sec per steps this would be a no go…

Hey, Rome wasn't built in a day, and something is better than nothing. Did I say it was fast? Nope. Did I say it was possible? Yep.

Am I hoping that people who know more about pytorch and all this jazz will be able to take my slow, basic implementation and further optimize it to be useable? Yes.

It's a proof-of-concept, and as it's currently the best solution I've found that doesn't require buying a new GPU or renting rack space somewhere, I'll take it.

Plus, on earlier test with the manual way, I was able to train ~4000 steps in about 8 hours. Again, not fantastic, but it was faster than this. So, still a WIP.

d8ahazard avatar Oct 10 '22 16:10 d8ahazard