SDfu icon indicating copy to clipboard operation
SDfu copied to clipboard

Stable Diffusers for studies

Stable Diffusers for studies

Open In Colab

This is yet another Stable Diffusion toolkit, aimed to be functional, clean & compact enough for various experiments. There's no GUI here, as the target audience are creative coders rather than post-Photoshop users. The latter may check InvokeAI or Fooocus as convenient production suites, or ComfyUI for flexible node-based workflows.

The toolkit is built on top of the diffusers library, with occasional additions from the others mentioned below. The following codebases are partially included here (to ensure compatibility and the ease of setup): CLIPseg, LPIPS.
There was also a similar repo (abandoned now), based on the CompVis and Stability AI libraries.

Current functions:

Fine-tuning with your images:

Other features:

  • Memory efficient with xformers (hi res on 6gb VRAM GPU)
  • Multi guidance technique for better interpolations
  • Self-attention guidance for better coherence and details
  • Use of special models: inpainting, SD v2, SDXL, Kandinsky
  • Masking with text via CLIPseg
  • Weighted multi-prompts (with brackets or numerical weights)
  • to be continued..

Setup

Install CUDA 11.8 if you're on Windows (seems not necessary on Linux with Conda).
Setup the Conda environment:

conda create -n SD python=3.10 numpy pillow 
activate SD
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install xformers

NB: It's preferrable to install xformers library - to increase performance and to run SD in any resolution on the lower grade hardware (e.g. videocards with 6gb VRAM). However, it's not guaranteed to work with all the (quickly changing) versions of pytorch, hence it's separated from the rest of requirements. If you're on Windows, first ensure that you have Visual Studio 2019 installed.

Run command below to download: Stable Diffusion 1.5, 1.5 Dreamlike Photoreal, 2-inpaint, 2.1, 2.1-v, custom VAE, LCM, ZeroScope, AnimateDiff, ControlNet, instruct-pix2pix, IP adapter with CLIPVision, CLIPseg models (converted to float16 for faster loading). Licensing info is available on their webpages.

python download.py

Operations

Examples of usage:

  • Generate an image from the text prompt:
python src/gen.py --in_txt "hello world" --size 1024-576
  • Redraw directory of images:
python src/gen.py --in_img _in/pix -t "neon light glow" --strength 0.7
  • Inpaint directory of images with inpainting model, turning humans into robots:
python src/gen.py -im _in/pix --mask "human, person" -t "steampunk robot" --model 2i
  • Make a video (frame sequence), interpolating between the lines of the text file:
python src/latwalk.py -t yourfile.txt --size 1024-576
  • Same, with drawing over a masked image:
python src/latwalk.py -t yourfile.txt -im _in/pix/alex-iby-G_Pk4D9rMLs.jpg --mask "human boy" --invert_mask -m 2i
  • Same as above, with recursive pan/zoom motion (beware of possible imagery degradation on longer runs):
python src/recur.py -t yourfile.txt --fstep 5 --scale 0.01 -m 15drm
  • Hallucinate a video, including your real images:
python src/latwalk.py -im _in/pix --cfg_scale 0 -f 1

Interpolations can be made smoother (and faster) by adding --latblend X option (latent blending technique, X in range 0~1). If needed, smooth the result further with FILM.
Models can be selected with --model option by either a shortcut (15, 15drm, 21, 21v, ..), a path on the Hugging Face website (e.g. SG161222/Realistic_Vision_V2.0, would be auto-downloaded for further use) or a local path to the downloaded file set (or safetensors file).
Coherence in details may be enhanced by Self-Attention Guidance with argument --sag_scale X (~1.5x slower, best with ddpm sampler). It works with per-frame generation and AnimateDiff, but not for latent blending (yet).
Check other options and their shortcuts by running these scripts with --help option.

One of the most impressive recent advances is ultrafast Consistency generation approach (used in LCM, TCD Scheduler and SDXL-Lightning techniques). It replaces regular diffusion part by the more direct latent prediction with distilled model, and requires very few (4 or more) steps to run. To use TCD Scheduler with any SD 1.5 base model, add --sampler tcd --load_lora h1t/TCD-SD15-LoRA --cfg_scale 1 -s X options where X is low (starting from 4). The quality seems to be sensitive to the prompt elaboration.

There are also few Windows bat-files, slightly simplifying and automating the commands.

Prompts

Text prompts may include brackets for weighting (like (good) [bad] ((even better)) [[even worse]]).
More radical blending can be achieved with multiguidance technique, introduced here (interpolating predicted noise within diffusion denoising loop, instead of conditioning vectors). It can be used to draw images from complex prompts like good prompt ~1 | also good prompt ~1 | bad prompt ~-0.5 with --cguide option, or for animations with --lguide option (further enhancing smoothness of latent blending). Note that it would slow down generation process.

It's possible also to use reference images as visual prompts by providing the path with --img_ref .. option. For a single reference, you can use either a single image, or any file set with --allref option. For an ordered scenario, you should provide a directory with image files or subdirectories (with images) to pick them one by one. The latter is preferrable, as the referencing quality is better when using 3-5 images than a single one. For instance, this would make a smooth interpolation over a directory of images as visual prompts:

python src/latwalk.py --img_ref _in/pix --latblend 0.8 --size 1024-576

Guide synthesis with ControlNet or Instruct pix2pix

  • Generate an image from existing one, using its depth map as conditioning (extra guiding source):
python src/preproc.py -i _in/something.jpg --type depth -o _in/depth
python src/gen.py --control_mod depth --control_img _in/depth/something.jpg -im _in/something.jpg -t "neon glow steampunk" -f 1

One can replace depth in the commands above with canny (edges) or pose (if there are humans in the source).
Option -im ... may be omitted to employ "pure" txt2img method, pushing the result closer to the text prompt:

python src/preproc.py -i _in/something.jpg --type canny -o _in/canny
python src/gen.py --control_mod canny --control_img _in/canny/something.jpg -t "neon glow steampunk" --size 1024-512 --model 15drm

ControlNet options can be used for interpolations as well (fancy making videomapping over a building photo?):

python src/latwalk.py --control_mod canny --control_img _in/canny/something.jpg --control_scale 0.5 -t yourfile.txt --size 1024-512 --fstep 5

also with pan/zoom recursion:

python src/recur.py -cmod canny -cnimg _in/canny/something.jpg -cts 0.5 -t yourfile.txt --size 1024-640 -fs 5 -is 12 --scale 0.02 -m 15drm

More ways to edit images

Instruct pix2pix:

python src/gen.py -im _in/pix --img_scale 2 -C 9 -t "turn human to puppet" --model 1p2p

TokenFlow (temporally stable!):

python src/tokenflow.py -im _in/yoursequence -t "rusty metallic sculpture" --batch_size 4 --batch_pivot --cpu

TokenFlow employs either pnp or sde method and can be used with various models & ControlNet options.
NB: this method handles all frames at once (that's why it's so stable). As such, it cannot consume long sequences by design. Pivots batching & CPU offloading (introduced in this repo) pushed the limits, yet didn't removed them. As an example, I managed to process only 300+ frames of 960x540 on a 3090 GPU in batches of 5 without OOM (or without going to the 10x slower shared RAM with new Nvidia drivers).

Text to Video

Generate a video from a text prompt with AnimateDiff motion adapter (may combine it with any base SD model):

python src/anima.py -t "fiery dragon in a China shop" -m 15drm --frames 100 --loop

Process existing video:

python src/anima.py -t "rusty metallic sculpture" -iv yourvideo.mp4 -f 0.7 -m 15drm

Generate a video interpolation over a text file (as text prompts) and a directory of images (as visual prompts):

python src/anima.py -t yourfile.txt -imr _in/pix -m 15drm --frames 200 

Generate a video from a text prompt with ZeroScope model (kinda obsolete):

python src/vid.py -t "fiery dragon in a China shop" --model vzs --frames 100 --loop

Process existing video:

python src/vid.py -t "combat in the dancehall" --in_vid yourvideo.mp4 --model vzs

NB: this model is limited to rather mundane stuff, don't expect any notable level of abstraction or fantasy here.

Fine-tuning

  • Train new token embedding for a specific subject (e.g. cat) with textual inversion:
python src/train.py --token mycat1 --term cat --data data/mycat1 -lr 0.001 --type text
  • Finetune the model (namely, part of the Unet attention layers) with LoRA:
python src/train.py --data data/mycat1 -lr 0.0001 --type lora
python src/train.py --token mycat1 --term cat --data data/mycat1 --term_data data/cat --type custom

Add --style if you're training for a style rather than an object. Speed up custom diffusion with --xformers (LoRA takes care of it on its own); add --low_mem if you get OOM.
Results of the trainings will be saved under train directory.

Custom diffusion trains faster and can achieve impressive reproduction quality (including faces) with simple similar prompts, but it can lose the point on generation if the prompt is too complex or aside from the original category. To train it, you'll need both target reference images (data/mycat1) and more random images of similar subjects (data/cat). Apparently, you can generate the latter with SD itself.
LoRA finetuning seems less precise while may affect wider spectrum of topics, and is a de-facto industry standard now.
Textual inversion is more generic but stable. Also, its embeddings can be easily combined together on load.

  • Generate an image with trained weights from LoRA:
python src/gen.py -t "cosmic beast cat" --load_lora mycat1-lora.pt
python src/gen.py -t "cosmic <mycat1> cat beast" --load_custom mycat1-custom.pt
  • Same with textual inversion (you may provide a folder path to load few files at once):
python src/gen.py -t "cosmic <mycat1> cat beast" --load_token mycat1-text.pt

Note that you should add (mind the brackets) <token> term .. keywords to the prompt to activate learned subject with Text Inversion or Custom Diffusion. Put it in the beginning for learned objects, or at the end for styles. LoRA is not bound to such syntax.

You can also run python src/latwalk.py ... with finetuned weights to make animations.

Special model: LCM

Ultrafast Latent Consistency Model (LCM) with only 2~4 steps to run; supported only for image generation (not for video!).
Examples of usage:

python src/gen.py -m lcm -t "hello world"
python src/gen.py -m lcm -im _in/pix -t "neon light glow" -f 0.5
python src/gen.py -m lcm -cmod depth -cnimg _in/depth/something.jpg -im _in/something.jpg -t "neon glow steampunk" -f 1
python src/latwalk.py -m lcm -t yourfile.txt
python src/latwalk.py -m lcm -t yourfile.txt -lb 0.75 -s 8

Special model: SDXL

SDXL is a high quality HD model which is mostly used these days.
Supported features: txt2img, img2img, image references, depth/canny controlnet, text interpolations with latent blending, dual prompts (native).
Unsupported (yet): video generation, multi guidance, fine-tuning, weighted prompts.
NB: The models (~8gb total) are auto-downloaded on the first use; you may download them yourself and set the path with --models_dir ... option.
As an example, interpolate with ControlNet and Latent Blending:

python src/sdxl.py -v -t yourfile.txt -cnimg _in/something.jpg -cmod depth -cts 0.6 --size 1280-768 -fs 5 -lb 0.75

Methods for ultrafast generation with only few steps:

  • distilled model SDXL-Lightning. Use it with --lightning -s X option where X = 2, 4 or 8. Pro: best quality; contra: requires special model.
  • TCD Scheduler. Use it with --sampler TCD --load_lora h1t/TCD-SDXL-LoRA --cfg_scale 1 -s X options where X is low (starting from 4). Pro: applicable to any SDXL model; contra: quality may be worse (sensitive to the prompts).

Generate a video with SDXL model and AnimateDiff motion adapter (beware: sensitive to complex prompts):

python src/sdxl.py -v -t "fiery dragon in a China shop" -ad guoyww/animatediff-motion-adapter-sdxl-beta -sm euler -s 23 --size 1024-576

Technically, AnimateDiff-XL supports fast SDXL-Lightning models and TCD Scheduler, but the results are very poor.

Special model: Kandinsky 2.2

Another interesting model is Kandinsky 2.2, featuring txt2img, img2img, inpaint, depth-based controlnet and simple interpolations. Its architecture and pipelines differ from Stable Diffusion, so there's also a separate script for it, wrapping those pipelines. The options are similar to the above; run python src/kand.py -h to see unused ones. It also consumes only unweighted prompts (no brackets, etc).
NB: The models (heavy!) are auto-downloaded on the first use; you may download them yourself and set the path with --models_dir ... option.
As an example, interpolate with ControlNet:

python src/kand.py -v -t yourfile.txt -cnimg _in/something.jpg -cts 0.6 --size 1280-720 -fs 5

Credits

It's quite hard to mention all those who made the current revolution in visual creativity possible. Check the inline links above for some of the sources. Huge respect to the people behind Stable Diffusion, Hugging Face, and the whole open-source movement.