CogVideo Potential mistake in CogVideoX Paper's Figure 2(a) Regarding 3D VAE Sampling

hi~ thank you for your opensource, it's a great work. I am currently reading the CogVideoX paper download from resources/ folder of this repo. I wounder, if there is a mistake in Figure 2 (a), as encoder of 3d VAE should compress the video with 2x DownSample instead of 2x Upsample, and similarly, the decoder should perform a 2x upsample？

Aug 06 '24 06:08 rayleichenxi

Thank you for your careful reminder for the typo, you are right, we will correct it.

Aug 06 '24 06:08 tengjiayan20

截屏2024-08-07 18 01 56 Additionally, the sentence in the second paragraph of Section 2.1 of the paper (highlighted in blue) seems to be inconsistent with the description of Figure 2. Should "the first two rounds of downsampling and upsampling" correspond to the top two in the blue box and the bottom two in the yellow box?

Aug 07 '24 10:08 kyrie111

Additionally, the sentence in the second paragraph of Section 2.1 of the paper (highlighted in blue) seems to be inconsistent with the description of Figure 2. Should "the first two rounds of downsampling and upsampling" correspond to the top two in the blue box and the bottom two in the yellow box?

You are right, we will update the wording in the official version of the paper to avoid ambiguity.

Aug 07 '24 12:08 tengjiayan20