Open-Sora-Plan icon indicating copy to clipboard operation
Open-Sora-Plan copied to clipboard

Question about latent size

Open linzai1992 opened this issue 11 months ago • 8 comments

This repo supports training a latent size of 225×90×90 (t×h×w), which means we are able to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition.

How to calculate 225 * 90 * 90 , 225 means 30 * 60 / 2 / 4 = 225 how to understand 90 * 90?

Thanks

linzai1992 avatar Mar 12 '24 10:03 linzai1992

I share the same concern. I can't locate the code for the '225 * 90 * 90' setting.

yhy-2000 avatar Mar 13 '24 02:03 yhy-2000

Sora technical report claims that 1920 * 1080 resolution (1080P) within any generation, but also able to generate 1 minute of video [not sure whether it is a one-time generation or multiple generation, the following calculation assumes that a one-time generation]. 1 minute of 30FPS video, corresponding to 1800 frames. Assuming that the model capacity limit is 1920 * 1080, converted square resolution is 1440 * 1440. The output video 1800 * 1440 * 1440 can be derived from 900 * 720 * 720 interpolated frames x2 and superscored x2. The corresponding latent size of 900 * 720 * 720 is 225 * 90 * 90, according to Video-VQVAE's 4x8x8 stride.

LinB203 avatar Mar 13 '24 04:03 LinB203

Sora technical report claims that 1920 * 1080 resolution (1080P) within any generation, but also able to generate 1 minute of video [not sure whether it is a one-time generation or multiple generation, the following calculation assumes that a one-time generation]. 1 minute of 30FPS video, corresponding to 1800 frames. Assuming that the model capacity limit is 1920 * 1080, converted square resolution is 1440 * 1440. The output video 1800 * 1440 * 1440 can be derived from 900 * 720 * 720 interpolated frames x2 and superscored x2. The corresponding latent size of 900 * 720 * 720 is 225 * 90 * 90, according to Video-VQVAE's 4x8x8 stride.

Thanks, is this trained using Latte with a patch size of 8?

yhy-2000 avatar Mar 13 '24 05:03 yhy-2000

Sora technical report claims that 1920 * 1080 resolution (1080P) within any generation, but also able to generate 1 minute of video [not sure whether it is a one-time generation or multiple generation, the following calculation assumes that a one-time generation]. 1 minute of 30FPS video, corresponding to 1800 frames. Assuming that the model capacity limit is 1920 * 1080, converted square resolution is 1440 * 1440. The output video 1800 * 1440 * 1440 can be derived from 900 * 720 * 720 interpolated frames x2 and superscored x2. The corresponding latent size of 900 * 720 * 720 is 225 * 90 * 90, according to Video-VQVAE's 4x8x8 stride.

Thanks, is this trained using Latte with a patch size of 8?

Yes for using Latte, but Video-VQVAE (just trained for testing by ourself and not released yet).

LinB203 avatar Mar 13 '24 05:03 LinB203

Sora technical report claims that 1920 * 1080 resolution (1080P) within any generation, but also able to generate 1 minute of video [not sure whether it is a one-time generation or multiple generation, the following calculation assumes that a one-time generation]. 1 minute of 30FPS video, corresponding to 1800 frames. Assuming that the model capacity limit is 1920 * 1080, converted square resolution is 1440 * 1440. The output video 1800 * 1440 * 1440 can be derived from 900 * 720 * 720 interpolated frames x2 and superscored x2. The corresponding latent size of 900 * 720 * 720 is 225 * 90 * 90, according to Video-VQVAE's 4x8x8 stride.

Thank you so much. Another question, does[Open-Sora-Plan] now support 225 * 90 * 90 one time forward training or multiple chunk training?

linzai1992 avatar Mar 15 '24 04:03 linzai1992

Sora technical report claims that 1920 * 1080 resolution (1080P) within any generation, but also able to generate 1 minute of video [not sure whether it is a one-time generation or multiple generation, the following calculation assumes that a one-time generation]. 1 minute of 30FPS video, corresponding to 1800 frames. Assuming that the model capacity limit is 1920 * 1080, converted square resolution is 1440 * 1440. The output video 1800 * 1440 * 1440 can be derived from 900 * 720 * 720 interpolated frames x2 and superscored x2. The corresponding latent size of 900 * 720 * 720 is 225 * 90 * 90, according to Video-VQVAE's 4x8x8 stride.

Thank you so much. Another question, does[Open-Sora-Plan] now support 225 * 90 * 90 one time forward training or multiple chunk training?

One time forward training.

LinB203 avatar Mar 15 '24 05:03 LinB203

We show some data in https://github.com/PKU-YuanGroup/Open-Sora-Plan/tree/main?tab=readme-ov-file#-improved-training-performance.

LinB203 avatar Mar 15 '24 05:03 LinB203

@LinB203 Hi, in my experiment, 4x8x8 stride setting in video-gpt is too aggressive(I use low embedding dim 4 for latter diffusion), do you guys have promised reconstruction with 4x8x8 stride and low embedding dim?

Birdylx avatar Mar 17 '24 14:03 Birdylx