VBench Why Human_Clothes model cann't output normally

while I infer my videos on "Human_Clothes" , it can not output normally , just output "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
how can I solve it?

Jul 28 '25 03:07 2016110071

Hi, the code runs correctly on our machine. Could you please provide the specific execution commands you used, so that we can better identify the issue?

Jul 28 '25 12:07 zhengdian1

Hi, the code runs correctly on our machine. Could you please provide the specific execution commands you used, so that we can better identify the issue?

I run the code without changes, but got the following warnings： Loaded LLaVA model:LLaVA-Video-7B-Qwen2 You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors. Loading vision tower: siglip-so400m-patch14-384 You are using a model of type siglip_vision_model to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors. Some weights of CLIPVisionModel were not initialized from the model checkpoint at siglip-so400m-patch14-384 and are newly initialized: ['vision_model.embeddings.class_embedding', 'vision_model.pre_layrnorm.bias', 'vision_model.pre_layrnorm.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of CLIPVisionModel were not initialized from the model checkpoint at siglip-so400m-patch14-384 and are newly initialized because the shapes did not match:

vision_model.embeddings.position_embedding.weight: found shape torch.Size([729, 1152]) in the checkpoint and torch.Size([730, 1152]) in the model instantiated You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 2025-07-30 20:56:28,724 - accelerate.utils.modeling - INFO - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory in to a higher value to use more memory (at your own risk).

and then the output is "!!!!!!!!!!!!!"

Jul 30 '25 13:07 2016110071

It seems that the siglip does not load correctly, please check whether the model is downloaded correctly. The correct information is like following:

Loaded LLaVA model: /mnt/petrelfs/xx/.cache/vbench2/lmms-lab/LLaVA-Video-7B-Qwen2 Loading vision tower: google/siglip-so400m-patch14-384 Loading checkpoint shards: 100%|██████████| 4/4 [00:45<00:00, 11.43s/it] Model Class: LlavaQwenForCausalLM

Jul 30 '25 13:07 zhengdian1