LLaVA-NeXT
LLaVA-NeXT copied to clipboard
About the LLaVA-OneVision 0.5B Visual tokens
I am re-evaluating the LLaVA-OneVision 0.5B on ActivityNet-QA and trying to get the value 50.5%. I get the model checkpoints using following commands:
warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
It will use the google--siglip-so400m-patch14-384 as backbone. I eval the model with hyper-parameters as following:
--for_get_frames_num 32 \
--mm_spatial_pool_stride 2 \
--mm_spatial_pool_mode average \
--mm_newline_position no_token \
--overwrite True \
The output of the visual backbone is torch.Size([32, 729, 896]), but I notice that each video frame will be encoded into 169 tokens after function self.get_2dPool, instead of 196 tokens mentioned here. Could you please comfirm what hyper-parameters used to get the number 50.5% on ActivityNet-QA. Thanks!