VLMEvalKit Fix Qwen Omni when use audio in video

Set default nframe=None to help Qwen Omni use it origin video understanding utils.
Add message type = audio to support separate video and audio input for QwenOmni
Unified self.use_audio_in_video for convenient control
Fix the bug in existing code that do not pass audio info into the processor.

Apr 28 '25 04:04 Mercury7353

Hi @Mercury7353. Thank you for your contribution to our codebase, but there is still one problem I want to ask: For Set default nframe=None to help Qwen Omni use it origin video understanding utils., our codebase will using the nframe setting in video dataset and make changes to nframe setting defined in qwen model. Unless also give nframe as None in video dataset config, or it will sample frames according to your setting in video dataset config. So , if we want to use the original video process setting in qwen-omni, it's better only input the video data_path (without nframe and fps) into the model, but it's conflict with our setting, so we can not do that.

Besides, what's your command of replicating WorldSense score in Qwen2.5-Omni? I want to have a try on it.

Apr 28 '25 13:04 FangXinyu-0913

Yes. I have reproduced the Qwen-Omni score on worldsense the code. It is 45.5

Apr 30 '25 06:04 Mercury7353

the command is : python run.py --data WorldSense_32frame --model Qwen2.5-Omni-7B But I set nframe to None in the model config:

    "Qwen2.5-Omni-7B": partial(
        Qwen2VLChat,
        model_path="Qwen/Qwen2.5-Omni-7B",
        min_pixels=1280 * 28 * 28,
        max_pixels=16384 * 28 * 28,
        use_custom_prompt=False,
        use_audio_in_video=True, # set use audio in video
        nframe=None, #disable nframe
    ),

Apr 30 '25 06:04 Mercury7353

你好，我在图文数据集上评测的性能差官方很多，请问这是为什么呢

May 12 '25 09:05 WenmuZhou