Comfy-SVDTools
Comfy-SVDTools copied to clipboard
Comfy-SVDTools
A collection of techniques that extend the functionality of Stable Video Diffusion in ComfyUI. Most of these were investigated for the purpose of extending context length; though they may be useful for other purposes as well. In certain cases you can generate videos up to four times the original trained context length of the model; though this will require some experimentation.
I've divided the functionality into two nodes: SVDToolsPatcher and SVDToolsPatcherExperimental. Techniques in SVDToolsPatcher are marked as 'Experimental' below and may change or be removed in the future. Techniques in SVDToolsPatcher tend to give good results, and probably won't change.
Examples
Baseline (48 frames, with SVD - originally trained for 12 frames):
https://github.com/brianfitzgerald/svd_extender/assets/2797445/18ca3513-cf12-4598-84d3-00a3f5eda682
48 frames, with timestep scaled to 12 frames:
https://github.com/brianfitzgerald/svd_extender/assets/2797445/ebccc0a3-f071-40f7-9a10-62bf0118487c
48 frames, with timestep scaled to 12 frames, and attn_k_scale of 0.7:
https://github.com/brianfitzgerald/svd_extender/assets/2797445/284e5ef7-ea30-4e47-9094-b51082f31867
Techniques
Position Embedding Scaling
Similar to YaRN for language models, this technique scales the position embeddings in the SpatialVideoTransformer layers to match a set embedding length. For example, if position_embedding_frames is set to 12, but the batch size is 42, the model will generate video with 42 frames, but the position embeddings will be scaled to 12 frames. This allows the model to generate video with a longer context length than the position embeddings would normally allow.
Settings
scale_timestep_embedding: Enable / disable position embedding scaling.position_embedding_frames: The number of frames to scale the position embeddings to. The model will be conditioned as if it were generating video with this many frames, but will actually generate video with the number of frames in the batch.
Key Scaling
Scales the keys only for temporal attention. Consistently leads to less jittering at higher motion bucket IDs, especially with long context windows.
Settings
temporal_attn_k_scale: Higher leads to more movement, lower leads to less movement. A value of 1.0 is the same as the default attention scaling.
Attention Windowing (Experimental)
Following the FreeNoise paper, this technique uses a windowed attention mechanism to only compute cross-attention in each temporal layer for a subset of the total latents.
Settings
attn_window_size: The size of the window to use for attention. This is the number of latents to attend to in each layer.attn_window_stride: The stride of the window. This is the number of latents to skip between each window, i.e. a stride of 6 with a window size of 12 will attend to latents 0-11, 6-17, 12-23, etc.shuffle_windowed_noise: Shuffles the initial batch of latents. This is a technique mentioned in the FreeNoise paper, and can sometimes help with inter-batch stability.
Temporal Attention Scale (Experimental)
An implementation of Jonathan Fischoff's technique for scaling the attention in each temporal layer. This scales the self attention values by sqrt(scale/dim_head).
Settings
temporal_attn_scale: Higher leads to more movement, lower leads to less movement. A value of 1.0 is the same as the default attention scaling.
How to Use
Simply download or git clone this repository in ComfyUI/custom_nodes. An example pipeline is provided in the resources folder in this repo.
Limitations
xformersmust be installed; this is temporary, until thescaleparameter is added to the self-attention nodes in ComfyUI.- The
SVDToolsPatchernodes override the Comfycomfy.sample.samplefunction, in order to unpatch theforwardmethod ofSpatialVideoTransformer. This may cause issues with other custom sample nodes. This is done as there's no way to patch theforwardmethod ofSpatialVideoTransformerusingModelPatcher- if this is added to Comfy in the future, this override will be removed.
Up Next
Techniques I'm either currently working on implementing or plan to implement in the future:
- [ ] FreeInit
- [ ] Motion transfer, following the FreeNoise implementation
- [ ] Looping mode (overlap the first and last windows)
- [ ] Text conditioning interpolation / blending