TencentARC/ST-LLM: Official implementation of the paper "ST-LLM: Larg...

ST-LLM

ST-LLM: Large Language Models Are Effective Temporal Learners

News :loudspeaker:

[2024/3/28] All codes and weights are available now! Welcome to watch this repository for the latest updates.

Introduction :bulb:

ST-LLM is a temporal-sensitive video large language model. Our model incorporates three key architectural:
- (1) Joint spatial-temporal modeling within large language models for effective video understanding.
- (2) Dynamic masking strategy and mask video modeling for efficiency and robustness.
- (3) Global-local input module for long video understanding.
ST-LLM has established new state-of-the-art results on MVBench, VideoChatGPT Bench and VideoQA Bench:

Method	MVBench	VcgBench						VideoQABench
Method	MVBench	Avg	Correct	Detail	Context	Temporal	Consist	MSVD	MSRVTT	ANet
VideoChatGPT	32.7	2.38	2.40	2.52	2.62	1.98	2.37	64.9	49.3	35.7
LLaMA-VID	-	2.89	2.96	3.00	3.53	2.46	2.51	69.7	57.7	47.4
Chat-UniVi	-	2.99	2.89	2.91	3.46	2.89	2.81	65.0	54.6	45.8
VideoChat2	51.1	2.98	3.02	2.88	3.51	2.66	2.81	70.0	54.1	49.1
ST-LLM	54.9	3.15	3.23	3.05	3.74	2.93	2.81	74.6	63.2	50.9

Demo 🤗

Please download the conversation weights from here and follow the instructions in installation first. Then, run the gradio demo:

CUDA_VISIBLE_DEVICES=0 python3 demo_gradio.py --ckpt-path /path/to/STLLM_conversation_weight

We have also prepared local scripts that are easy to modify：demo.py

Examples 👀

Video Description: for high-difficulty videos with complex scene changes, ST-LLM can accurately describe all the contents.

Action Identification: ST-LLM can accurately and comprehensively describe the actions occurring in the video.

Reasoning: for the challenging open-ended reasoning questions, STLLM can also provide reasonable answers.

Installation 🛠️

Git clone our repository, creating a Python environment and activate it via the following command

git clone https://github.com/farewellthree/ST-LLM.git
cd ST-LLM
conda create --name stllm python=3.10
conda activate stllm
pip install -r requirement.txt

Training & Validation :bar_chart:

The instructions of data, training and evaluating can be found in trainval.md.

Acknowledgement 👍

Video-ChatGPT and MVBench Great job contributing video LLM benchmark.
InstuctBLIP and MiniGPT4 The codebase and the basic image LLM we built upon.

Citation ✏️

If you find the code and paper useful for your research, please consider staring this repo and citing our paper:

@article{liu2023one,
  title={One for all: Video conversation is feasible without video instruction tuning},
  author={Liu, Ruyang and Li, Chen and Ge, Yixiao and Shan, Ying and Li, Thomas H and Li, Ge},
  journal={arXiv preprint arXiv:2309.15785},
  year={2023}
}

@article{liu2023one,
  title={ST-LLM: Large Language Models Are Effective Temporal Learners},
  author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},
  journal={https://arxiv.org/abs/2404.00308},
  year={2023}
}

ST-LLM
ST-LLM copied to clipboard

Metadata

ST-LLM: Large Language Models Are Effective Temporal Learners

News :loudspeaker:

Introduction :bulb:

Demo 🤗

Examples 👀

Installation 🛠️

Training & Validation :bar_chart:

Acknowledgement 👍

Citation ✏️

← Metadata

Owner

Metadata

ST-LLM ST-LLM copied to clipboard

Metadata

ST-LLM: Large Language Models Are Effective Temporal Learners

News :loudspeaker:

Introduction :bulb:

Demo 🤗

Examples 👀

Installation 🛠️

Training & Validation :bar_chart:

Acknowledgement 👍

Citation ✏️

← Metadata

Owner

Metadata

ST-LLM
ST-LLM copied to clipboard