Open-Sora-Plan
Open-Sora-Plan copied to clipboard
Questions about the CausalVideoVAE
Hi, thanks for introducing CausalVideoVAE. I'd like to know the training video's resolution for the CausalVideoAE and the compression rate is 8* 8* 4, right? Then how much are the feature dimensions of the latent? How to use the CausalVideoAE to reconstruct an image, just repeat the frames to a static video? When reconstructing the video, how do you deal with such a lot of frames (sample by a specific fps and then reconstruct frames by groups)? Besides, is the pre-trained weight available now? Can't wait to test :)
Latent dim is 4. And we will train a 16 version. Do not need repeat image, just input a image with the shape of 1 x 1 x 3 x 256 x 256 (b x t x c x h x w) Continuous video can be reconstructed, or fps can be specified. The checkpoint and train script are on the way.
Latent dim is 4. And we will train a 16 version. Do not need repeat image, just input a image with the shape of 1 x 1 x 3 x 256 x 256 (b x t x c x h x w) Continuous video can be reconstructed, or fps can be specified. The checkpoint and train script are on the way.
Good work! I have the same question, cant wait to test the CausalVideoVAE!
Does the CausalVideoVAE released now? I'd like to test if it can be used to train a code book
Did CausalVideoVAE use 17 frames for training, or was it trained with a variable number of frames? I'm asking because I noticed the same model can perform inference with any number of frames, so I'm interested in understanding some details.
Checkpoint is on HF, see here. And its pretrained weights is here