Open-Sora-Plan
Open-Sora-Plan copied to clipboard
Question about latent size
This repo supports training a latent size of 225×90×90 (t×h×w), which means we are able to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition.
How to calculate 225 * 90 * 90 , 225 means 30 * 60 / 2 / 4 = 225 how to understand 90 * 90?
Thanks
I share the same concern. I can't locate the code for the '225 * 90 * 90' setting.
Sora technical report claims that 1920 * 1080 resolution (1080P) within any generation, but also able to generate 1 minute of video [not sure whether it is a one-time generation or multiple generation, the following calculation assumes that a one-time generation]. 1 minute of 30FPS video, corresponding to 1800 frames. Assuming that the model capacity limit is 1920 * 1080, converted square resolution is 1440 * 1440. The output video 1800 * 1440 * 1440 can be derived from 900 * 720 * 720 interpolated frames x2 and superscored x2. The corresponding latent size of 900 * 720 * 720 is 225 * 90 * 90, according to Video-VQVAE's 4x8x8 stride.
Sora technical report claims that 1920 * 1080 resolution (1080P) within any generation, but also able to generate 1 minute of video [not sure whether it is a one-time generation or multiple generation, the following calculation assumes that a one-time generation]. 1 minute of 30FPS video, corresponding to 1800 frames. Assuming that the model capacity limit is 1920 * 1080, converted square resolution is 1440 * 1440. The output video 1800 * 1440 * 1440 can be derived from 900 * 720 * 720 interpolated frames x2 and superscored x2. The corresponding latent size of 900 * 720 * 720 is 225 * 90 * 90, according to Video-VQVAE's 4x8x8 stride.
Thanks, is this trained using Latte with a patch size of 8?
Sora technical report claims that 1920 * 1080 resolution (1080P) within any generation, but also able to generate 1 minute of video [not sure whether it is a one-time generation or multiple generation, the following calculation assumes that a one-time generation]. 1 minute of 30FPS video, corresponding to 1800 frames. Assuming that the model capacity limit is 1920 * 1080, converted square resolution is 1440 * 1440. The output video 1800 * 1440 * 1440 can be derived from 900 * 720 * 720 interpolated frames x2 and superscored x2. The corresponding latent size of 900 * 720 * 720 is 225 * 90 * 90, according to Video-VQVAE's 4x8x8 stride.
Thanks, is this trained using Latte with a patch size of 8?
Yes for using Latte, but Video-VQVAE (just trained for testing by ourself and not released yet).
Sora technical report claims that 1920 * 1080 resolution (1080P) within any generation, but also able to generate 1 minute of video [not sure whether it is a one-time generation or multiple generation, the following calculation assumes that a one-time generation]. 1 minute of 30FPS video, corresponding to 1800 frames. Assuming that the model capacity limit is 1920 * 1080, converted square resolution is 1440 * 1440. The output video 1800 * 1440 * 1440 can be derived from 900 * 720 * 720 interpolated frames x2 and superscored x2. The corresponding latent size of 900 * 720 * 720 is 225 * 90 * 90, according to Video-VQVAE's 4x8x8 stride.
Thank you so much. Another question, does[Open-Sora-Plan] now support 225 * 90 * 90 one time forward training or multiple chunk training?
Sora technical report claims that 1920 * 1080 resolution (1080P) within any generation, but also able to generate 1 minute of video [not sure whether it is a one-time generation or multiple generation, the following calculation assumes that a one-time generation]. 1 minute of 30FPS video, corresponding to 1800 frames. Assuming that the model capacity limit is 1920 * 1080, converted square resolution is 1440 * 1440. The output video 1800 * 1440 * 1440 can be derived from 900 * 720 * 720 interpolated frames x2 and superscored x2. The corresponding latent size of 900 * 720 * 720 is 225 * 90 * 90, according to Video-VQVAE's 4x8x8 stride.
Thank you so much. Another question, does[Open-Sora-Plan] now support 225 * 90 * 90 one time forward training or multiple chunk training?
One time forward training.
We show some data in https://github.com/PKU-YuanGroup/Open-Sora-Plan/tree/main?tab=readme-ov-file#-improved-training-performance.
@LinB203 Hi, in my experiment, 4x8x8 stride setting in video-gpt is too aggressive(I use low embedding dim 4
for latter diffusion), do you guys have promised reconstruction with 4x8x8 stride and low embedding dim?