SSPVS-PyTorch
SSPVS-PyTorch copied to clipboard
Pytorch implementation for "Progressive Video Summarization via Multimodal Self-supervised Learning"
Progressive Video Summarization via Multimodal Self-supervised Learning (SSPVS)
Paper | Supplementary Material
Haopeng Li, Qiuhong Ke, Mingming Gong, Tom Drummond
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023
Introduction
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task.
Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both coarse-grained and fine-grained fashions, as well as recovering masked frames in the videos.
The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs.
Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
Requirements and Dependencies
- python=3.8.13
- pytorch=1.12, ortools=9.3.10497
- pytorch-lightning=1.6.5
- pytorch-transformers=1.2.0
Self-supervised Pretraining
Download the pretrained model to the root dictionary.
OR
Follow the following steps to train the self-supervised model.
Data Preparation
Download the visual features and text information embeddings of the YTVT dataset and uncompress them to ssl/features/ and ssl/info_embed/, respectively.
Self-supervised Pretraining
Run the following command in ssl/ to train the self-supervised model:
$ CUDA_VISIBLE_DEVICES=0,1 python main_ssl.py --config ssl.yaml
The trained model is saved in ssl/results/SSL/checkpoints/.
Progressive Video Summarization
Data Preparation
Download the data and uncompress it to data/.
Training and Evaluation of Video Summarization
Run the following command in the root dictionary to train the video summarization model:
$ sh main.sh CFG_FILE
where CFG_FILE is a configuration file (*.yaml) for different settings. We provide several configuration files in cfgs/. Here is an example for training the model on SumMe in the augmented setting:
$ sh main.sh cfgs/sm_a.yaml
If you pretrain the model yourself, change resume in CFG_FILE to the model saved in ssl/results/SSL/checkpoints/. The results of video summarization are recoded in records.csv.
Source Data
We provide the original videos and text information of YTVT here. Besides, we also provide the re-collected text information of SumMe and TVSum here.
License and Citation
The use of this code is RESTRICTED to non-commercial research and educational purposes.
If you use this code or reference our paper in your work please cite this publication as:
@inproceedings{li2023progressive,
title={Progressive Video Summarization via Multimodal Self-supervised Learning},
author={Li, Haopeng and Ke, Qiuhong and Gong, Mingming and Drummond, Tom},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={5584--5593},
year={2023}
}
Acknowledgement
The code is developed based on VASNet.