ST-LLM
ST-LLM copied to clipboard
Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"
ST-LLM: Large Language Models Are Effective Temporal Learners
News :loudspeaker:
- [2024/3/28] All codes and weights are available now! Welcome to watch this repository for the latest updates.
Introduction :bulb:
- ST-LLM is a temporal-sensitive video large language model. Our model incorporates three key architectural:
- (1) Joint spatial-temporal modeling within large language models for effective video understanding.
- (2) Dynamic masking strategy and mask video modeling for efficiency and robustness.
- (3) Global-local input module for long video understanding.
- ST-LLM has established new state-of-the-art results on MVBench, VideoChatGPT Bench and VideoQA Bench:
| Method | MVBench | VcgBench | VideoQABench | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Avg | Correct | Detail | Context | Temporal | Consist | MSVD | MSRVTT | ANet | ||
| VideoChatGPT | 32.7 | 2.38 | 2.40 | 2.52 | 2.62 | 1.98 | 2.37 | 64.9 | 49.3 | 35.7 |
| LLaMA-VID | - | 2.89 | 2.96 | 3.00 | 3.53 | 2.46 | 2.51 | 69.7 | 57.7 | 47.4 |
| Chat-UniVi | - | 2.99 | 2.89 | 2.91 | 3.46 | 2.89 | 2.81 | 65.0 | 54.6 | 45.8 |
| VideoChat2 | 51.1 | 2.98 | 3.02 | 2.88 | 3.51 | 2.66 | 2.81 | 70.0 | 54.1 | 49.1 |
| ST-LLM | 54.9 | 3.15 | 3.23 | 3.05 | 3.74 | 2.93 | 2.81 | 74.6 | 63.2 | 50.9 |
Demo 🤗
Please download the conversation weights from here and follow the instructions in installation first. Then, run the gradio demo:
CUDA_VISIBLE_DEVICES=0 python3 demo_gradio.py --ckpt-path /path/to/STLLM_conversation_weight
We have also prepared local scripts that are easy to modify:demo.py
Examples 👀
- Video Description: for high-difficulty videos with complex scene changes, ST-LLM can accurately describe all the contents.
- Action Identification: ST-LLM can accurately and comprehensively describe the actions occurring in the video.
- Reasoning: for the challenging open-ended reasoning questions, STLLM can also provide reasonable answers.
Installation 🛠️
Git clone our repository, creating a Python environment and activate it via the following command
git clone https://github.com/farewellthree/ST-LLM.git
cd ST-LLM
conda create --name stllm python=3.10
conda activate stllm
pip install -r requirement.txt
Training & Validation :bar_chart:
The instructions of data, training and evaluating can be found in trainval.md.
Acknowledgement 👍
- Video-ChatGPT and MVBench Great job contributing video LLM benchmark.
- InstuctBLIP and MiniGPT4 The codebase and the basic image LLM we built upon.
Citation ✏️
If you find the code and paper useful for your research, please consider staring this repo and citing our paper:
@article{liu2023one,
title={One for all: Video conversation is feasible without video instruction tuning},
author={Liu, Ruyang and Li, Chen and Ge, Yixiao and Shan, Ying and Li, Thomas H and Li, Ge},
journal={arXiv preprint arXiv:2309.15785},
year={2023}
}
@article{liu2023one,
title={ST-LLM: Large Language Models Are Effective Temporal Learners},
author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},
journal={https://arxiv.org/abs/2404.00308},
year={2023}
}