Custom Diffusion implementation
A very rudimentary Custom Diffusion implementation
What is Custom Diffusion
Custom Diffusion is, in short, finetuning-lite with TI. Instead of tuning the whole model, only the K and V matrices of the cross-attention blocks are tuned simultaneously with token embedding(s). It has similar speed and memory requirements to TI and supposedly gives better results in less steps.
How to use this
(WIP, will sort out the UI/UX later)
Training
Currently the Textual Inversion training UI is hijacked for this. Just train as you would a normal TI embedding. Under the training log directory, alongside with name-steps.pt you should also see name-steps.kv.safetensors, which contain KV weights (~50MB at half precision uncompressed).
The original study used learning_rate=8e-5 at batch_size=8, but in my tests I found lr=1e-4 suits better for my dataset (batch_size=1).
Using trained weights
Place the .safetensors files under models/pluggables (--pluggable-dir). Select them in the Stable Diffusion/Pluggable weights options in the settings tab. The trained weights should be swapped in and now you can use them to generate images. Use the token embedding like a normal TI embedding.
Although the study claims that using a tuned token gives a better result, in my limited tests dropping the token or replacing it with the untrained version doesn't affect the generated image too much.
Anyways, this area is under-explored, and I encourage you to try out different settings and datasets.
Todo (roughly ordered by priority)
- [ ] UI/UX
- [ ] More testing
- [x] Separate lr for embedding and model weights
- [ ] Let users choose what weights to finetune
- [ ] Regularization
- [ ] Multi-concept training
Is this going to be an extension like the other dreambooth implementations?
I could make it one. I didn't do it originally because the code change is quite light for now, but I guess that could change as features get added.
Could you explain how is it different from LoRa from technical point of view? I'm genuinely interested. Looks like the only difference is simultaneous training of TI? Also lora finetunes additional delta instead of the whole matrix, which allows switching and merging them? I wanted to mention decomposition, but paper you provided states that CD also able to utilize same technique.
LoRA in general is a reparameterization a matrix. cloneofsimo/lora applies it on Pivot Tuning, which finds the TI and then finetunes the UNet and the text encoder.
Low-rank decomposition can be optionally applied after training.
I suppose LoRA can be applied on CD as well, but with so few params it might not worth the hassle. It would also look suspiciously like hypernetworks if you do that.
CD switching is simple. For merging, they have a nice formula in section 3.2 of the paper.
I'm continuing the development of this as an extension.