VBench Unable to reproduce CogVideo-2B / 5B T2V results with Diffusers

Hi, thank you for your great work!

I tried to replicate the text-to-video results of CogVideo-2B and CogVideo-5B using the Hugging Face diffusers implementation, but the scores differ noticeably from those reported in the project (including the examples in the linked Google Drive).

framework: diffusers (latest main) seeds tested: 0 – 4 benchmark: vbench (default settings) I did not modify any code or hyperparameters from the code in Hugging Face Diffusers. Is there an internal config, hidden flag, or updated checkpoint that I should be using to match the published results?

Any pointers would be greatly appreciated, I'd like to reproduce the Google Drive outputs as closely as possible.

Thanks!

May 04 '25 11:05 dnwjddl

Thanks for the question. You should use VBench-Long to evaluate these two models.

Use VBench for evaluating videos shorter than 5.0 seconds (< 5.0s). Use VBench-Long for evaluating videos 5.0 seconds or longer (≥ 5.0s), and these models appear on both VBench and VBench-Long sub-leaderboards. Each benchmark is optimized for its respective video length to ensure fair and consistent evaluation.

May 04 '25 15:05 ziqihuangg

Thanks for the quick clarification about VBench and VBench-Long!

I’m now trying to reproduce the demo clips on the Google Drive(https://drive.google.com/drive/folders/1bm7zSu2wxT7wY8IpitmJc_7qi8EYMCim), but the videos I generate come out noticeably lower in visual quality. For context, I’m on the latest diffusers (main) and haven’t touched any settings beyond what ships with the checkpoint:

from diffusers import CogVideoXPipeline
import torch, argparse

# tested with seeds 0–4
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16,
    generator=torch.Generator().manual_seed(args.seed)
).to("cuda")

video = pipe(prompt).frames

No custom scheduler, guidance scale, FPS, or other hyper-params—just the defaults.

Could you let me know if there’s an config file, or updated weights you used for the Drive demos?

Thanks a lot for your time!

May 04 '25 17:05 dnwjddl

We followed the settings provided in the official GitHub repo, using the default parameter values defined in inference/cli_demo.py as a reference — only modifying the following arguments: height=480, width=720, num_frames=49, and fps=8.

May 06 '25 11:05 Jacky-hate

Unfortunately, I’m still unable to reproduce the results, and the file “cogvideox-5b diffusers” is no longer available at the Google Drive link you shared (https://drive.google.com/drive/folders/1bm7zSu2wxT7wY8IpitmJc_7qi8EYMCim). :-(

We followed the settings provided in the official GitHub repo, using the default parameter values defined in inference/cli_demo.py as a reference — only modifying the following arguments: height=480, width=720, num_frames=49, and fps=8.

Could you share the output videos generated by your code? I’d like to compare them with my results to pinpoint what might be going wrong. :)

May 11 '25 09:05 dnwjddl

Sorry for the inconvenience! Here's the link to the sampled video output: https://drive.google.com/file/d/1FSAccPXyJR_uw5ldkQJAMzIVphLRuh39/view?usp=drive_link

May 16 '25 07:05 Jacky-hate