Open-Sora Where does zero-padding to image happen in the code?

From the report_03, it said

For both stage 1 and stage 2 training, we adopt 20% images and 80% videos. Following Magvit-v2, we train video using 17 frames, while zero-padding the first 16 frames for image. However, we find that this setting leads to blurring of videos with length different from 17 frames. Thus, in stage 3, we use a random number within 34 frames for mixed video length training (a.k.a., zero-pad the first 43-n frames if we want to train a n frame video), to make our VAE more robust to different video lengths. Our training and inference code is available in the Open-Sora 1.2 release.

However, I did not find the zero-padding operation in the code. Is there any mistake in the doc? How to you handle image-video mixed training exactly?

I list some of my findings below.

For VAE training, this line REPEATS the image according to the T-dim

# repeat
video = image.unsqueeze(0).repeat(self.num_frames, 1, 1, 1)

For diffusion training, images are loaded directly add the T-dim WITHOUT expansion through this line

# repeat
video = image.unsqueeze(0)

In VAE_Temporal, there is a padding operation in this line

time_padding = (
    0
    if (x.shape[2] % self.time_downsample_factor == 0)
    else self.time_downsample_factor - x.shape[2] % self.time_downsample_factor
)
x = pad_at_dim(x, (time_padding, 0), dim=2)

However, the padding length is not micro_frame_size (17) but time_downsample_factor (4).

Jul 29 '24 07:07 flymin

This issue is stale because it has been open for 7 days with no activity.

Aug 06 '24 01:08 github-actions[bot]

Any update on this?

Aug 06 '24 03:08 flymin

This issue is stale because it has been open for 7 days with no activity.

Aug 15 '24 01:08 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Aug 22 '24 01:08 github-actions[bot]