Strange VAE decoder outputs with different number of frames
System Info / 系統信息
Hi, I found that the 3D VAE decoder (which takes in 1 + 4 * N frames) seems to have different behavior when N is even or odd.
In my testing, I am simply inputting an sequence of 1 + N * 4 frames where every frame are zeros except for the first frame. Then I pass it through the decoder and plot the original inputs vs. reconstructed output.
When N is even, such as with totally 9 frames (N = 2) the result looks good:
However when N is odd, such as with 5 frames (N=1) or 13 frames (N=3), the 0-th frame is padded multiple times in the reconstructed output. I wonder why this is the case?
Information / 问题信息
- [ ] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
lat_dist = vae.encode(video).latent_dist
latents = lat_dist.sample() * vae.config.scaling_factor
img_latents_dist = vae.encode(video_single).latent_dist
img_latents = img_latents_dist.sample() * vae.config.scaling_factor
recon = vae.decode(latents / vae.config.scaling_factor).sample
Then I'm plotting the video frames and reconstructed frames.
Expected behavior / 期待表现
Input frames and reconstructed frames should match.
latent by 4 frames