TeViT
TeViT copied to clipboard
Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral
Temporally Efficient Vision Transformer for Video Instance Segmentation
Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR 2022, Oral)
by Shusheng Yang1,3, Xinggang Wang1 :email:, Yu Li4, Yuxin Fang1, Jiemin Fang1,2, Wenyu Liu1, Xun Zhao3, Ying Shan3.
1 School of EIC, HUST, 2 AIA, HUST, 3 ARC Lab, Tencent PCG, 4 IDEA.
(:email:) corresponding author.
- This repo provides code, models and training/inference recipes for TeViT(Temporally Efficient Vision Transformer for Video Instance Segmentation).
- TeViT is a transformer-based end-to-end video instance segmentation framework. We build our framework upon the query-based instance segmentation methods, i.e.,
QueryInst
. - We propose a messenger shift mechanism in the transformer backbone, as well as a spatiotemporal query interaction head in the instance heads. These two designs fully utlizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost.

Models and Main Results
- We provide both checkpoints and codalab server submissions on
YouTube-VIS-2019
dataset.
Name | AP | AP@50 | AP@75 | AR@1 | AR@10 | Params | model | submission |
---|---|---|---|---|---|---|---|---|
TeViT_MsgShifT | 46.3 | 70.6 | 50.9 | 45.2 | 54.3 | 161.83 M | link | link |
TeViT_MsgShifT_MST | 46.9 | 70.1 | 52.9 | 45.0 | 53.4 | 161.83 M | link | link |
- We have conducted multiple runs due to the training instability and checkpoints above are all the best one among multiple runs. The average performances are reported in our paper.
- Besides basic models, we also provide TeViT with
ResNet-50
andSwin-L
backbone, models are also trained onYouTube-VIS-2019
dataset. - MST denotes multi-scale traning.
Name | AP | AP@50 | AP@75 | AR@1 | AR@10 | Params | model | submission |
---|---|---|---|---|---|---|---|---|
TeViT_R50 | 42.1 | 67.8 | 44.8 | 41.3 | 49.9 | 172.3 M | link | link |
TeViT_Swin-L_MST | 56.8 | 80.6 | 63.1 | 52.0 | 63.3 | 343.86 M | link | link |
- Due to backbone limitations, TeViT models with
ResNet-50
andSwin-L
backbone are conducted withSTQI Head
only (i.e., without our proposedmessenger shift mechanism
). - With
Swin-L
as backbone network, we apply more instance queries (i.e., from 100 to 300) and stronger data augmentation strategies. Both of them can further boost the final performance.
Installation
Prerequisites
- Linux
- Python 3.7+
- CUDA 10.2+
- GCC 5+
Prepare
- Clone the repository locally:
git clone https://github.com/hustvl/TeViT.git
- Create a conda virtual environment and activate it:
conda create --name tevit python=3.7.7
conda activate tevit
- Install YTVOS Version API from youtubevos/cocoapi:
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI
- Install Python requirements
torch==1.9.0
torchvision==0.10.0
mmcv==1.4.8
pip install -r requirements.txt
- Please follow Docs to install
MMDetection
python setup.py develop
- Download
YouTube-VIS 2019
dataset from here, and organize dataset as follows:
TeViT
├── data
│ ├── youtubevis
│ │ ├── train
│ │ │ ├── 003234408d
│ │ │ ├── ...
│ │ ├── val
│ │ │ ├── ...
│ │ ├── annotations
│ │ │ ├── train.json
│ │ │ ├── valid.json
Inference
python tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT
After inference process, the predicted results is stored in results.json
, submit it to the evaluation server to get the final performance.
Training
- Download the COCO pretrained
QueryInst
with PVT-B1 backbone from here. - Train TeViT with 8 GPUs:
./tools/dist_train.sh configs/tevit/tevit_msgshift.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
- Train TeViT with multi-scale data augmentation:
./tools/dist_train.sh configs/tevit/tevit_msgshift_mstrain.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
- The whole training process will cost about three hours with 8 TESLA V100 GPUs.
- To train TeViT with
ResNet-50
orSwin-L
backbone, please download the COCO pretrained weights fromQueryInst
.
Acknowledgement :heart:
This code is mainly based on mmdetection
and QueryInst
, thanks for their awesome work and great contributions to the computer vision community!
Citation
If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :
@inproceedings{yang2022tevit,
title={Temporally Efficient Vision Transformer for Video Instance Segmentation,
author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},
booktitle = {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
year = {2022}
}