sd-scripts Faster training with token downsampling

This PR adds support for training with token downsampling and replaces my token merge PR (#1146).

Token downsampling is a lossy optimization that significantly speeds up inference and training. It tries to avoid the quality loss of token merging by only downsampling K and V in the attention operation (Q is preserved) and replaces the expensive token similarity calculation with simple downsampling. Applying the optimization during training seems to have less quality loss compared to inference, so I was able to increase the amount of downsampling a lot without negative effects.

(Note: I'm using an AMD GPU. Based on some user reports, the speedup is much less dramatic on Nvidia with xformers. ToDo might be simply closing the gap in my case and you might not achieve the same speedup reported below. I'd welcome benchmarks from Nvidia users) With a downsampling factor of 2 and resolution of 768px I get a 2x speedup for SD1.x LoRA training. Downsampling factor 4 with max_downsample=2 gave me an even bigger 3.2x speedup with basically no quality loss. SDXL benefits less because its architecture is already more efficient, but I still saw about 1.3x speedup at 1024px with downsample factor 2 and 1.7x with factor 4. The potential speedup is larger at higher resolutions.

This PR adds two new flags:

--todo_factor: (float) token downsampling factor > 1. The inputs of the unet's self-attention layers are scaled down by this factor. Recommend 2-4. Multiple values can be specified to override the factor for different depths.
--todo_max_depth: ([1, 2, 3, 4]) maximum depth to apply ToDo. Max for SDXL is 2. Recommend 1 or 2. Default is autodetected based on the number of values passed to todo_factor.

Sample usage: --todo_factor 4 --todo_max_depth 2 is the same as --todo_factor 4 4

The unet is patched when the model is loaded, so the optimization should automatically work with all training scripts, but I only tested train_network.py and sdxl_train_network.py. The downsampling operation is implemented with pytorch hooks, so model saving should be unaffected.

Example:

Name	Downsample factor	s/it	Speedup
feffy-v3.50	None	2.0	1x
feffy-todo2	2	1.0	2x
feffy-todo4_2	4	0.63	3.2x

Training details:

7900 XTX with --mem_eff_attn
768px resolution with bucketing
Batch size 4

Mar 03 '24 21:03 feffy380

Very impressive improvement in speed as well as maintaining training quality.

Additionally, when I use --todo_args "downsample_method=nearest-exact", a parameter parsing error occurs, but it does not affect anything since it is the default value

Mar 04 '24 10:03 gesen2egee

@gesen2egee I see the problem and pushed a fix.

To be honest, --todo_args doesn't seem necessary. It's a leftover from the research code which had options for both ToDo and ToMe. Here it's only needed for downsample_method, and the default method already works so well that there isn't much reason to use a more expensive interpolation mode.

~~If nobody objects, I'm going to remove --todo_args~~

Actually, I think the arguments need an overhaul. I'm not sure the average user needs this much granularity when setting the downsampling factor. The way ToMe does it is they apply the same merge ratio to everything and use max_downsample to control which layers are affected. I'll train another lora with this approach and if it works, we can simplify the arguments to a single --todo_factor and --todo_max_downsample=[1,2,4,8]

Mar 04 '24 13:03 feffy380

Ok I think that settles it. I'll go for simplicity

feffy-todo4_4-comparison

Mar 04 '24 16:03 feffy380

--todo_factor now accepts multiple values if you want to override the downsampling factor for individual depths.

Example: --todo_factor 4 2 will use factor=4 for depth 1 and factor=2 for depth 2.

Mar 11 '24 18:03 feffy380

sd-scripts sd-scripts copied to clipboard

Faster training with token downsampling

sd-scripts
sd-scripts copied to clipboard