Dreambooth-Stable-Diffusion DreamBooth Stable Diffusion training now possible in 10 GB VRAM, and it runs about 2 times faster.

Hey, So I managed to run Stable Diffusion dreambooth training in just 17.7GB GPU usage by replacing the attention with memory efficient flash attention from xformers. Along with using way less memory, it also runs 2 times faster. So it's possible to train SD in 24GB GPUs now. Tested on Nvidia A10G, took 15-20 mins to train. I hope it's helpful.

Code in my fork: https://github.com/ShivamShrirao/diffusers/blob/main/examples/dreambooth/

Screenshot_20220927_042425

Can even train on batch size of 2.

Screenshot_20220927_042648

With some more tweaks it might be possible to train even on 16 GB gpus.

And it works, Outputs: Me in Fortnite

https://github.com/huggingface/diffusers/pull/554#issuecomment-1258751183

Sep 26 '22 23:09 ShivamShrirao

Very cool. Doing what I can for 16gb too.

Sep 27 '22 07:09 TemporalLabsLLC-SOL

I'm running into issues with it finding the gpus I think. 4xA10G. I'll post code tomorrow.

Sep 27 '22 10:09 TemporalLabsLLC-SOL

Wow, Using the 8bit adam optimizer from bitsandbytes along with xformers reduces the memory usage to 12.5 GB. Colab: https://colab.research.google.com/github/ShivamShrirao/diffusers/blob/main/examples/dreambooth/DreamBooth_Stable_Diffusion.ipynb Screenshot_20220927_213651 Code: https://github.com/ShivamShrirao/diffusers/blob/main/examples/dreambooth/ Screenshot_20220927_185927

Sep 27 '22 13:09 ShivamShrirao

There is no such file. 404 Client Error: Entry Not Found for url: https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/config.json

Edit: Issue resolved.

Sep 27 '22 13:09 Daniel-Kelvich

Do you have a donation link? I don't have much, but you are doing great work.

Sep 27 '22 15:09 Mistborn-First-Era

Do you have a donation link? I don't have much, but you are doing great work.

Hey, Thanks. No donation link haha. Good to hear you liked it. It has been quite fun to do for me.

Sep 27 '22 18:09 ShivamShrirao

@ShivamShrirao I've been trying to run your notebook on Runpod with Pytorch and an A5000 but I'm getting an error during pip install "Building wheel for xformers (setup.py) ... error". Training starts with a bitsandbytes bug report but runs and eventually after 20 min of training it crashes.

I'd also love to donate if I can get this working.

Sep 27 '22 18:09 pdjohntony

There is no such file. 404 Client Error: Entry Not Found for url: https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/config.json

Edit: Issue resolved.

@Daniel-Kelvich How did you fix this?

Sep 27 '22 20:09 pdjohntony

@pdjohntony What error are you facing ? If 404, it may be due to not being authenticated with huggingface cli.

Sep 27 '22 20:09 ShivamShrirao

@ShivamShrirao I managed to get your dreambooth example working but its been running for 2 hours now on an A5000.

Since thats taking so long, I spun up another instance on vast with 2 A5000's but now I'm getting the 404. It shouldn't be an auth issue with huggingface as a logged in on the CLI and it appeared to download the model for a while before getting this 404 error.

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `24` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
  File "/opt/conda/envs/ldm/lib/python3.8/site-packages/transformers/configuration_utils.py", line 596, in _get_config_dict
    resolved_config_file = cached_path(
  File "/opt/conda/envs/ldm/lib/python3.8/site-packages/transformers/utils/hub.py", line 282, in cached_path
    output_path = get_from_cache(
  File "/opt/conda/envs/ldm/lib/python3.8/site-packages/transformers/utils/hub.py", line 486, in get_from_cache
    _raise_for_status(r)
  File "/opt/conda/envs/ldm/lib/python3.8/site-packages/transformers/utils/hub.py", line 409, in _raise_for_status
    raise EntryNotFoundError(f"404 Client Error: Entry Not Found for url: {request.url}")
transformers.utils.hub.EntryNotFoundError: 404 Client Error: Entry Not Found for url: https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/config.json

Sep 27 '22 20:09 pdjohntony

Great work! I managed to run it in a google colab. I was just wondering, how do I get checkpoint files that I can use later on from the model files that are stored?

I could only find the feature_extractor logs model_index.json safety_checker scheduler text_encoder tokenizer unet vae folders/files that were stored in the --output_dir=$OUTPUT_DIR after it was done training.

Sep 27 '22 21:09 roar-emaus

@roar-emaus These are the diffuser version of weights. I have added an inference example in colab on how to use them in diffusers. For others you will need to convert them.

Sep 27 '22 21:09 ShivamShrirao

@roar-emaus These are the diffuser version of weights. I have added an inference example in colab on how to use them in diffusers. For others you will need to convert them.

Thank you! will test it tomorrow :)

Sep 27 '22 21:09 roar-emaus

finally got it to work, how can we use the model to reuse in a stable colab @ShivamShrirao ? I have used the inference but how do i save my model, i havent even been able to find what folder its in lol, any info on how to convert it into a ckpt?? great work !!

Sep 27 '22 22:09 Ai-Artsca

finally got it to work, how can we use the model to reuse in a stable colab @ShivamShrirao ? I have used the inference but how do i save my model, i havent even been able to find what folder its in lol, any info on how to convert it into a ckpt?? great work !!

I haven't figured out yet how to convert to single ckpt to use in other repos. Currently the whole folder is your model, you can save the whole folder until someone figures it out. This needs to be reversed https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py

Sep 27 '22 22:09 ShivamShrirao

@ShivamShrirao If I'm reading things right, 8bit AdamW should be a drop in replacement and the modified CrossAttention class seems like it should just be able to replace the one in ldm/modules/attention.py in this repository. Sadly can't test it myself because bitsandbytes has a C extension that uses CUDA and I'm on AMD

Sep 27 '22 22:09 hopibel

successfully trained one model, but my second time training im getting an error @ShivamShrirao

Steps: 2% 18/1000 [00:56<45:45, 2.80s/it, loss=0.536, lr=5e-6]Traceback (most recent call last): File "train_dreambooth.py", line 606, in main() File "train_dreambooth.py", line 527, in main for step, batch in enumerate(train_dataloader): File "/usr/local/lib/python3.7/dist-packages/accelerate/data_loader.py", line 357, in iter next_batch = next(dataloader_iter) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 721, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "train_dreambooth.py", line 268, in getitem instance_image = Image.open(self.instance_images_path[index % self.num_instance_images]) File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2843, in open fp = builtins.open(filename, "rb") IsADirectoryError: [Errno 21] Is a directory: '/content/data/sks/.ipynb_checkpoints' Steps: 2% 18/1000 [00:56<51:30, 3.15s/it, loss=0.536, lr=5e-6] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--use_auth_token', '--instance_data_dir=/content/data/sks', '--class_data_dir=/content/data/gfx', '--output_dir=/content/models/sks', '--with_prior_preservation', '--instance_prompt=photo of sks gfx', '--class_prompt=photo of a gfx', '--resolution=512', '--use_8bit_adam', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=200', '--max_train_steps=1000']' returned non-zero exit status 1.

Sep 28 '22 02:09 Ai-Artsca

Very nice progress! Digging in more now

Sep 28 '22 02:09 TemporalLabsLLC-SOL

@pdjohntony try to update transformers library pip install -U transformers

Sep 28 '22 06:09 Daniel-Kelvich

@ShivamShrirao I'm assuming you mean only the items in the imv folder make up the ckpt file, I deleted my colab and only saved those items to the google drive

Sep 28 '22 06:09 ClashSAN

@ShivamShrirao

in the collab

  --instance_prompt="photo of imv{CLASS_NAME}" \
  --class_prompt="photo of a {CLASS_NAME}" \

are no f strings, they should be right ?

cheers

Sep 28 '22 08:09 binarymind

@binarymind Not required here cause it executes as a shell command.

Sep 28 '22 08:09 ShivamShrirao

ok thanks !

during this cell I got the following result

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--num_cpu_threads_per_process` was set to `32` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py:179: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
Fetching 16 files: 100%|█████████████████████| 16/16 [00:00<00:00, 13678.94it/s]
Generating class images:   0%|                           | 0/25 [00:00<?, ?it/s]FATAL: this function is for sm80, but was built for sm750
FATAL: this function is for sm80, but was built for sm750

my nvidia-smi is the following

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.94       Driver Version: 470.94       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:0F:00.0 Off |                  Off |
| 30%   27C    P8    26W / 300W |      1MiB / 48685MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I tried also to do the

%pip install git+https://github.com/facebookresearch/xformers@1d31a3a#egg=xformers some cells above as it was not working. currently stucked there

Sep 28 '22 08:09 binarymind

Lol I fixed my problem by removing the f strings I added.... sorry

edit: ah nope was not that, launched again the notebook on a new repo and the problem appear again, looking at it

Sep 28 '22 08:09 binarymind

I'm hoping for a (fingers crossed not too distant) future version of this that can run on requirements of a 3080. Will put it into reach of many more people including myself. Keep up the great work!!

Sep 28 '22 11:09 TheChapster

I'm not having any success. Trying to use V100 on colab.

Generating class images:   0% 0/50 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "train_dreambooth.py", line 606, in <module>
    main()
  File "train_dreambooth.py", line 362, in main
    images = pipeline(example["prompt"]).images
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 259, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_2d_condition.py", line 254, in forward
    encoder_hidden_states=encoder_hidden_states,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_blocks.py", line 565, in forward
    hidden_states = attn(hidden_states, context=encoder_hidden_states)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 155, in forward
    hidden_states = block(hidden_states, context=context)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 204, in forward
    hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 288, in forward
    hidden_states = xformers.ops.memory_efficient_attention(query, key, value)
  File "/usr/local/lib/python3.7/dist-packages/xformers/ops.py", line 575, in memory_efficient_attention
    query=query, key=key, value=value, attn_bias=attn_bias, p=p
  File "/usr/local/lib/python3.7/dist-packages/xformers/ops.py", line 196, in forward_no_grad
    causal=isinstance(attn_bias, LowerTriangularMask),
  File "/usr/local/lib/python3.7/dist-packages/torch/_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--use_auth_token', '--instance_data_dir=/content/data/sks', '--class_data_dir=/content/data/dog', '--output_dir=/content/models/sks', '--with_prior_preservation', '--instance_prompt=photo of sks dog', '--class_prompt=photo of a dog', '--resolution=512', '--use_8bit_adam', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=200', '--max_train_steps=600']' returned non-zero exit status 1

Sep 28 '22 11:09 JoeMcGuire

@JoeMcGuire you will need to compile the xformers, current wheels only support T4 GPU.

Sep 28 '22 11:09 ShivamShrirao

there are xformers for p100 on this colab precompiled, how to incorporate those into dreambooth ? It will cover colab pro https://colab.research.google.com/github/TheLastBen/fast-stable-diffusion/blob/main/fast_stable_diffusion_AUTOMATIC1111.ipynb#scrollTo=a---cT2rwUQj

under installing xformers Also how about optional googledrive cell to upload trained model + prune cell to get it to 2gb? If some of You will compile whl for p100 please download and store it in gdrive to share

Sep 28 '22 13:09 1blackbar

yeah , now its kinda not useable on webuis and most people are on webuis, huggingface love their bins also default 600 steps are pretty bad, not sure why its default ? should be more like at least 2000

Sep 28 '22 15:09 1blackbar

Any chances to run on 12GB rtx 3060? I'm getting Tried to allocate 4.00 GiB (GPU 0; 12.00 GiB total capacity; 4.81 GiB already allocated; 890.00 MiB free; 8.81 GiB reserved in total by PyTorch) error even with --use_8bit_adam flag

Sep 28 '22 20:09 Blucknote

Dreambooth-Stable-Diffusion Dreambooth-Stable-Diffusion copied to clipboard

DreamBooth Stable Diffusion training now possible in 10 GB VRAM, and it runs about 2 times faster.

Can even train on batch size of 2.

Dreambooth-Stable-Diffusion
Dreambooth-Stable-Diffusion copied to clipboard