Potential mistake in CogVideoX Paper's Figure 2(a) Regarding 3D VAE Sampling
hi~ thank you for your opensource, it's a great work. I am currently reading the CogVideoX paper download from resources/ folder of this repo. I wounder, if there is a mistake in Figure 2 (a), as encoder of 3d VAE should compress the video with 2x DownSample instead of 2x Upsample, and similarly, the decoder should perform a 2x upsample?
Thank you for your careful reminder for the typo, you are right, we will correct it.
Additionally, the sentence in the second paragraph of Section 2.1 of the paper (highlighted in blue) seems to be inconsistent with the description of Figure 2.
Should "the first two rounds of downsampling and upsampling" correspond to the top two in the blue box and the bottom two in the yellow box?
Additionally, the sentence in the second paragraph of Section 2.1 of the paper (highlighted in blue) seems to be inconsistent with the description of Figure 2. Should "the first two rounds of downsampling and upsampling" correspond to the top two in the blue box and the bottom two in the yellow box?
You are right, we will update the wording in the official version of the paper to avoid ambiguity.
Additionally, the sentence in the second paragraph of Section 2.1 of the paper (highlighted in blue) seems to be inconsistent with the description of Figure 2. Should "the first two rounds of downsampling and upsampling" correspond to the top two in the blue box and the bottom two in the yellow box?