CogVideo 📣 [Feature Update] ✨ REAL DDIM Inversion ✨ is now possible on CogVideoX!

It is well known that applying DDIM inversion in CogVideoX and attempting to reconstruct from the inverted latent often leads to results with high saturation and a washed-out appearance.

https://github.com/user-attachments/assets/81f30713-40f6-4d7a-b618-0bb8695b7ddd

⏳ Background

To solve this inverse problem, a ddim_inversion.py script was recently shared in the CogVideoX repository.

However, this implementation takes a non-standard approach. Instead of directly using the inverted latent as the initial noise for reconstruction, it employs the inverted latent as a reference for the KV caching mechanism.

Specifically, at each timestep and for every DiT layer, the model performs two separate attention computations:

One attention pass using the concatenation of the current noise and the reference latent (key, value with key_reference, value_reference)
A second pass using only the reference latent, which is stored for attention sharing in the next layer. (please refer to corresponding lines)

✨ Simple and Efficient Solution

In our new paper Dynamic View Synthesis as an Inverse Problem we first focus on this inverse problem.

As a result of our work, one can simply invert & reconstruct a real video using the following steps:

Inversion Steps

Invert the source video using DDIMInverseScheduler
Save only the inverted latent (Let's call it latents)

Reconstruction Steps

Encode the source video example implementation:

init_latents = [retrieve_latents(self.vae.encode(vid.unsqueeze(0)), generator) for vid in video]

Then apply our proposal K-RNR in prepare_latents:

k = 3 # see the paper for the why the value 3 is optimal
for i in range(k):
    latents = self.scheduler.add_noise(init_latents, latents)
return latents

One can use the resulting latents as an input to the transformer block to obtain sharp reconstructions in a training-free and very efficient manner. More video examples can be found in our supplementary videos.

If you use K-RNR, cite us:

@article{yesiltepe2025dynamic,
  title={Dynamic View Synthesis as an Inverse Problem},
  author={Yesiltepe, Hidir and Yanardag, Pinar},
  journal={arXiv preprint arXiv:2506.08004},
  year={2025}
}

Jul 03 '25 04:07 yesiltepe-hidir

latents = self.scheduler.add_noise(init_latents, latents)

What's the timestep for the add_noise? According to the original paper, it should be timesteps[0], timesteps[1] and timesteps[2]?

Jul 13 '25 08:07 lty2226262

@yesiltepe-hidir Amazing work! Can we find somewhere your implementation? Also, what timesteps are used for the add_noise?

Aug 10 '25 17:08 nysp78

Could you release all the code? My result seems unpleasant

Sep 06 '25 04:09 akeeei