delving-deeper-into-the-decoder-for-video-captioning
delving-deeper-into-the-decoder-for-video-captioning copied to clipboard
Source code for Delving Deeper into the Decoder for Video Captioning
Delving Deeper into the Decoder for Video Captioning
Table of Contents
- Description
- Requirement
- Manual
- Results
- Comparison on Youtube2Text
- Comparison on MSR-VTT
- Data
- Citation
Description
This repository is the source code for the paper named Delving Deeper into the Decoder for Video Captioning.
The paper has been accepted by ECAI 2020. The encoder-decoder framework is the most popular paradigm for video captioning task. There still exist some non-negligible problems in the decoder of a video captioning model. We propose three methods to improve the performance of the model.
- A combination of variational dropout and layer normalization is embeded into semantic compositional gated recurrent unit to alleviate the problem of overfitting.
- A unified, flexible method is proposed to evaluate the model performance on a validation set so as to select the best checkpoint for testing.
- A new training strategy called professional learning is proposed which develops the strong points of a captioning model and bypasses its weaknesses.
It is demonstrated in the experiments of MSVD and MSR-VTT datasets that our model has achieved the best results evaluated by BLEU, CIDEr, METEOR and ROUGE-L metrics with significant gains of up to 11.7% on MSVD and 5% on MSR-VTT compared with the previous state-of-the-art models.
If you need more information about how to generate training, validating and testing data for the datasets, please refer to Semantics-AssistedVideoCaptioning.
Requirement
- Python 3.6
- TensorFlow-GPU 1.13
- pycocoevalcap (Python3)
- NumPy
Manual
- Make sure you have installed all the required packages.
- Download files in the Data section.
cd path_to_directory_of_model; mkdir savesrun_model.shis used for training or testing models. Specify the GPU you want to use by modifyingCUDA_VISIBLE_DEVICESvalue.namewill be used in the name of saved model during training. Specify the needed data paths by modifyingcorpus,ecores,tagandrefvalues.testrefers to the path of the saved model which is to be tested. Do not give a parameter totestif you want to train a model.- After completing the configuration of the bash file, then
bash run_model.shfor training or testing.
Results
Comparison on Youtube2Text

Comparison on MSR-VTT

Data
MSVD
- MSVD dataset and features:
GoogleDrive
- SHA-256 ca86eb2b90e302a4b7f3197065cad3b9be5285905952b95dbffb61cb0bf79e9c
- Model Checkpoint:
GoogleDrive
- SHA-256 64089a49fe9de895c9805a85d50160404cb36ccb8c22a70a32fc7ef5a2abfff1
MSRVTT
- MSRVTT dataset and features:
GoogleDrive
- SHA-256 611b297c4fbbdd58540373986453a991f285aed6cc18914ad930e1e7646f26fb
- Model Checkpoint:
GoogleDrive
- SHA-256 fb04fd2d29900f7f8a712b6d2352e8227acd30173274b64a38fcea6a608e4a8e
Citation
@article{chen2020delving,
title={Delving Deeper into the Decoder for Video Captioning},
author={Haoran Chen and Jianmin Li and Xiaolin Hu},
journal={CoRR},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2001.05614},
eprint={2001.05614},
year={2020}
}
