Where does zero-padding to image happen in the code?
From the report_03, it said
For both stage 1 and stage 2 training, we adopt 20% images and 80% videos. Following Magvit-v2, we train video using 17 frames, while zero-padding the first 16 frames for image. However, we find that this setting leads to blurring of videos with length different from 17 frames. Thus, in stage 3, we use a random number within 34 frames for mixed video length training (a.k.a., zero-pad the first 43-n frames if we want to train a n frame video), to make our VAE more robust to different video lengths. Our training and inference code is available in the Open-Sora 1.2 release.
However, I did not find the zero-padding operation in the code. Is there any mistake in the doc? How to you handle image-video mixed training exactly?
I list some of my findings below.
For VAE training, this line REPEATS the image according to the T-dim
# repeat
video = image.unsqueeze(0).repeat(self.num_frames, 1, 1, 1)
For diffusion training, images are loaded directly add the T-dim WITHOUT expansion through this line
# repeat
video = image.unsqueeze(0)
In VAE_Temporal, there is a padding operation in this line
time_padding = (
0
if (x.shape[2] % self.time_downsample_factor == 0)
else self.time_downsample_factor - x.shape[2] % self.time_downsample_factor
)
x = pad_at_dim(x, (time_padding, 0), dim=2)
However, the padding length is not micro_frame_size (17) but time_downsample_factor (4).
This issue is stale because it has been open for 7 days with no activity.
Any update on this?
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.