LTX-Video Additional conditioning layers

I am trying to find a way to augment LTX to support additional conditioning by using a pose video latent—encoding that information through the VAE—to enable posable controllability.

I have looked at how you conditioned the model to support image-to-video. I was considering fine-tuning the model with image-to-video conditioning and adding an extra cross-attention layer alongside the original one, along the lines of:

Simplified idea. x_sa = self_attn(x) # Self-Attention x_ca_A = cross_attn_A(x_sa, cond_A) # Cross-Attention for condition Text x_ca_B = cross_attn_B(x_ca_A, cond_B)# Cross-Attention for condition Pose output = feed_forward(x_ca_B) # Final feed-forward network

freezing the model and only finetuning new cross-attention layers. Does this seem like a sensible thing to do? Do you have any tips or thoughts?

How would you recommend doing this?, @yoavhacohen

Feb 12 '25 16:02 ArEnSc

Adding new cross-attention layers is a valid option, but you’ll likely need to handle positional embeddings. The current cross-attention layers don’t use positional embeddings, so you might want to reference how we handle them in self-attention.

Keep in mind that this is different from how we condition the model for image-to-video, which is implemented as a temporal inpainting task using a different timestep for the conditioning tokens—see the paper for details.

Your approach of freezing the model and fine-tuning only the new cross-attention layers makes sense, especially if your goal is to minimize catastrophic forgetting while maintaining the base model’s capabilities. You might also consider training a LoRA adapter for the rest of the model alongside the new cross-attention layers to allow for more flexible adaptation.

Would be happy to discuss more details if you have specific constraints or goals in mind!

Feb 12 '25 20:02 yoavhacohen

@yoavhacohen Just to clarify, I am trying to extend and build upon image-to-video conditioning with pose conditioning are you suggesting that we can extend the current image-to-video conditioning mechanism to also incorporate explicit pose information to better guide the temporal generation? How do you envision integrating this additional pose conditioning or would this be using the cross attention layers be effective? The end result I am trying to achieve is to use the pose conditioning to steer a character in an image.

Feb 12 '25 23:02 ArEnSc

Our image-to-video conditioning is implemented as a temporal inpainting task, using a different timestep for the conditioning tokens, it doesn’t rely on cross-attention.

If you want to apply a similar approach for pose conditioning, simply add more tokens with the same positional embeddings as the generated ones, but assign a different timestep embedding to the conditioning tokens.

Feb 13 '25 13:02 yoavhacohen

Ok, so I believe I understand...

You concatenate the pose tokens alongside the seed image tokens and noise for the remaining sequence you want to predict. keep the noise low to clean for the pose tokens and the seed image tokens.

I suspect this will increase the memory size due to the sequence length a bit during inference.

The target to predict is the whole conditioned sequence the flow or velocity to that sequence.

During inference, you add the conditioning with a tiny bit of noise to the initial conditioning tokens. Then after inference, you just peel off the "denoised" tokens

Feb 18 '25 17:02 ArEnSc

Hi @ArEnSc, I am also interested in adding conditioning to LTX. Did you try to add more tokens? How it perform?

Mar 20 '25 12:03 AlfaranoAndrea