Strange VAE decoder outputs with different number of frames

Open xrhan opened this issue 5 months ago • 1 comments

System Info / 系統信息

Hi, I found that the 3D VAE decoder (which takes in 1 + 4 * N frames) seems to have different behavior when N is even or odd.

In my testing, I am simply inputting an sequence of 1 + N * 4 frames where every frame are zeros except for the first frame. Then I pass it through the decoder and plot the original inputs vs. reconstructed output.

When N is even, such as with totally 9 frames (N = 2) the result looks good:

However when N is odd, such as with 5 frames (N=1) or 13 frames (N=3), the 0-th frame is padded multiple times in the reconstructed output. I wonder why this is the case?

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

lat_dist = vae.encode(video).latent_dist
latents = lat_dist.sample() * vae.config.scaling_factor

img_latents_dist = vae.encode(video_single).latent_dist
img_latents = img_latents_dist.sample() * vae.config.scaling_factor

recon = vae.decode(latents / vae.config.scaling_factor).sample

Then I'm plotting the video frames and reconstructed frames.

Expected behavior / 期待表现

Input frames and reconstructed frames should match.

Jul 09 '25 00:07 xrhan

latent by 4 frames

Jul 28 '25 09:07 gityihang