LLaVA-NeXT icon indicating copy to clipboard operation
LLaVA-NeXT copied to clipboard

Llava-video slowfast mode

Open HYUNJS opened this issue 1 year ago • 11 comments
trafficstars

Thank you for sharing the great work. I ask for some mismatch between the current codebase and arXiv technical report.

  1. SlowFast mode
  • Is slowfast representation only used for inference time, as performed by Xu et al., 2024b@Apple? Otherwise, is it used for both training and inference?
  • The current code does not perform slowfast mode, by default. May I ask for the reason?
  1. Time Instruction
  • I cannot find the content mentioning about "time instruction" in the technical report, but the code is using it by default. May I ask is it also used for inference-time only?

HYUNJS avatar Oct 14 '24 10:10 HYUNJS

Same question. But I think that slow-fast is only used for inference, as done in Xu et al. (2024b). Additionally, in the pre-trained model Video-7B-Qwen2, the add_faster_video is set to False.

countytown avatar Oct 15 '24 02:10 countytown

I'm concerned about that, too. I find the following statements in the paper: "We consider the same video representation configurations for the training and inference stages. On 128 NVIDIA H100 GPUs, the video representations for LLaVA-Video-7B and LLaVA-Video-72B are V = (64, 679, 1, 2) and V = (64, 679, 3, 2), respectively." Maybe the slowfast mode also used in the 72B model's training stage, instead of only for 72B model's inference stage?

yuanrr avatar Oct 15 '24 02:10 yuanrr

Also, I'm curious why not use it on the 7B model.

yuanrr avatar Oct 15 '24 02:10 yuanrr

Maybe the slowfast mode also used in the 72B model's training stage, instead of only for 72B model's inference stage?

Based on the config.json of 72B model, it seems that slowfast is not used for training time (add_faster_video: false)

https://huggingface.co/lmms-lab/LLaVA-Video-72B-Qwen2/blob/main/config.json#L3

HYUNJS avatar Oct 15 '24 04:10 HYUNJS

Maybe the slowfast mode also used in the 72B model's training stage, instead of only for 72B model's inference stage?

Based on the config.json of 72B model, it seems that slowfast is not used for training time (add_faster_video: false)

https://huggingface.co/lmms-lab/LLaVA-Video-72B-Qwen2/blob/main/config.json#L3

Yes, I also found that. It is a mismatch.

yuanrr avatar Oct 15 '24 04:10 yuanrr

I have the same question, has anyone found the answer?

zhipeixu avatar Mar 11 '25 04:03 zhipeixu

same Q here! confused about paper / code discrepancy

sam-motamed avatar Apr 18 '25 13:04 sam-motamed

Hi all,

As mentioned in the paper, the slow-fast approach was only applied during the training of the 72B model.

However, we did not adopt the slow-fast configuration during inference for either the 7B or 72B models when reporting academic benchmark performance (in paper). This is because we found that the slow-fast setup is primarily beneficial in scenarios where the number of frames significantly exceeds the number of tokens per frame—such as in tasks like VideoMME-Long. In other benchmarks, it sometimes led to performance regressions.

Since our evaluation benchmarks include both short and long videos—with short videos being the majority—we chose not to use the slow-fast configuration in the default inference setup. Otherwise, one would need to manually enable or disable slow-fast for each benchmark, which adds complexity. Our default setting simplifies implementation and ensures consistent evaluation across all datasets, making the work easier to reproduce.

That said, we encourage you to enable the slow-fast approach if your use case involves longer videos and aligns with the conditions described above.

ZhangYuanhan-AI avatar Apr 18 '25 17:04 ZhangYuanhan-AI

Thanks @ZhangYuanhan-AI for the answer on this. So for the 7B model, only 10 frames are given to the model with full tokens (no pooling) per frame?

sam-motamed avatar Apr 18 '25 18:04 sam-motamed

The basic setting is pooling stride=2

ZhangYuanhan-AI avatar Apr 19 '25 04:04 ZhangYuanhan-AI

One last question @ZhangYuanhan-AI; 7B-Qwen model uses 10 frames during training correct? or does it sample varying frames based on fps?

sam-motamed avatar Apr 21 '25 13:04 sam-motamed