CogVideo
CogVideo copied to clipboard
Question about the VAE upsampling
Hi, I am trying to understand the logic of the CogVideoXUpsample3D. I found that it seems like, for tensors that have an odd t
dimension, the first frame will be treated separately for spatial only. (https://github.com/huggingface/diffusers/blob/2b443a5d621bd65f5cbf854195aef29cedd24058/src/diffusers/models/upsampling.py#L386)
Can you explain what is the purpose of this? Are you trying to preserve the parity of the t
dimension?
Thanks!