CogVideo SAT sampling results are worse than Diffusers sampling results

System Info / 系統信息

Hi CogVideo Team,

First of all, thank you so much for open-sourcing such great models for community to research text-video generative models.

I tried out both Diffusers and SAT codebase, and I found out the sampling results from SAT are much worse than Diffusers. Here is some example:

Prompt: "A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something. Its eyes are wide and happy as it jogs forward, scanning the branches, flowers, and leaves as it walks. The path is narrow as it makes its way between all the plants. the scene is captured from a ground-level angle, following the cat closely, giving a low and intimate perspective. The image is cinematic with warm tones and a grainy texture. The scattered daylight between the leaves and plants above creates a warm contrast, accentuating the cat’s orange fur. The shot is clear and sharp, with a shallow depth of field."

Diffusers:

https://github.com/user-attachments/assets/7f6db0ee-db6a-4f28-9dd0-288f41d61a43

SAT:

https://github.com/user-attachments/assets/4526c32d-9a65-4c50-ae86-9aace94cb4a2

Prompt: "A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about."

Diffusers:

https://github.com/user-attachments/assets/46788cfd-7986-4d26-b7ae-5b06929b98ed

SAT:

https://github.com/user-attachments/assets/561e52e1-b1e5-49e4-8115-7cda950d9a3b

It would be very kind of authors to look into this issue. It will help the research community to build exciting projects upon CogVideoX. Truly appreciate your help on this issue. Looking forward to your reply.

Best, Jiarui

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Run provided inference files Diffusers: https://github.com/THUDM/CogVideo/blob/main/inference/cli_demo.py

SAT: https://github.com/THUDM/CogVideo/blob/main/sat/inference.sh

Expected behavior / 期待表现

The SAT results are expected to be at the same level quality as Diffusers.

Aug 29 '24 05:08 xvjiarui

You need to use bf16 for 5B model.

Aug 29 '24 10:08 tengjiayan20

Thank you so much for your quick reply. I just fixed that. And it seems the quality is significantly improved. But still a little bit worse than diffusers. Do you have any clue? Or it's just because random sampling?

https://github.com/user-attachments/assets/785ae31d-41bb-47f9-bd2c-e67755105baf

https://github.com/user-attachments/assets/b351a8b8-28a3-4a77-8de8-f79dd0ea88bc

Aug 31 '24 01:08 xvjiarui

Make sure to compare the original videos generated by Diffusers, not the ones enhanced with super-resolution and interpolation.

If everything has been aligned, then the difference is likely due to randomness. The Diffusers model was migrated from this model’s weights without any additional training.

Aug 31 '24 02:08 zRzRzRzRzRzRzR

Thank you so much! I have checked it's aligned. So it may due to randomness.

Sep 01 '24 20:09 xvjiarui

CogVideo CogVideo copied to clipboard

SAT sampling results are worse than Diffusers sampling results

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

CogVideo
CogVideo copied to clipboard