Wenyi Hong

Results 8 comments of Wenyi Hong

Hi, different attention channels are calculated independently, and are added up later in the unit of tokens instead of patches. As mentioned in sec 3.2 in our paper, the temporal...

您好,CogVideo初始生成的帧的分辨率是160*160,super-resolution可以把其超分到480*480。由于CogVideo使用的VQ-VAE解码器与CogView2相同,因此直接使用了CogView2的超分方法。具体可以参考 Ding, M., Zheng, W., Hong, W., & Tang, J. (2022). CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers. arXiv preprint arXiv:2204.14217.

We use 13*8 A100 to train the model. The two stages were trained for ~100k iterations in total, which took ~20 days.

The video sample used in the training process is of multiple frame rates, including 1, 2, 4, 8 fps. Due to the limitation of GPU memory and the large scale...

> generate paired prompts and then use larger batch size (perhaps by accumulation) What's the appropriate batch size, from your experience?

Hi, It takes around 25GB GPU memory to inference with batchsize=1 (on our A100).

It takes around 25GB GPU memory to inference with batchsize=1 (on our A100).

> hi, what's the runtime problem?