S2-Transformer
S2-Transformer copied to clipboard
[IJCAI 2022] Official Pytorch code for paper “S2 Transformer for Image Captioning”
S2 Transformer for Image Captioning [IJCAI 2022]
Official code implementation for the paper S2 Transformer for Imgae Captioning
Pengpeng Zeng, Haonan Zhang, Jingkuan Song, and Lianli Gao
Table of Contents
- Environment setup
- Data Preparation
- Training
- Evaluation
- Reference and Citation
- Acknowledgements
Environment setup
Clone this repository and create the m2release
conda environment using the environment.yml
file:
conda env create -f environment.yaml
conda activate m2release
Then download spacy data by executing the following command:
python -m spacy download en_core_web_md
[!NOTE] Python 3 is required to run our code. If you suffer network problems, please download
en_core_web_md
library from here, unzip and place it to/your/anaconda/path/envs/m2release/lib/python*/site-packages/
Data Preparation
- Annotation. Download the annotation file m2_annotations [1]. Extract and put it in the project root directory.
-
Feature. Download processed image features ResNeXt-101 and ResNeXt-152 features [2] (code
9vtB
), put it in the project root directory.
Update: Image features on OneDrive
Training
Run python train_transformer.py
using the following arguments:
Argument | Possible values |
---|---|
--exp_name |
Experiment name |
--batch_size |
Batch size (default: 50) |
--workers |
Number of workers, accelerate model training in the xe stage. |
--head |
Number of heads (default: 8) |
--resume_last |
If used, the training will be resumed from the last checkpoint. |
--resume_best |
If used, the training will be resumed from the best checkpoint. |
--features_path |
Path to visual features file (h5py) |
--annotation_folder |
Path to annotations |
--num_clusters |
Number of pseudo regions |
For example, to train the model, run the following command:
python train_transformer.py --exp_name S2 --batch_size 50 --m 40 --head 8 --features_path /path/to/features --num_clusters 5
or just run:
bash train.sh
[!NOTE] We apply
torch.distributed
to train our model, you can set theworldSize
in train_transformer.py to determine the number of GPUs for your training.
Evaluation
Offline Evaluation.
Run python test_transformer.py
to evaluate the model using the following arguments:
python test_transformer.py --batch_size 10 --features_path /path/to/features --model_path /path/to/saved_transformer_models/ckpt --num_clusters 5
[!TIP] We have removed the
SPICE
evaluation metric during training because it is time-cost. You can add it when evaluating the model: download this file and put it in/path/to/evaluation/
, then uncomment codes in init.py.
We provide checkpoint here, you will get the following results (second row):
Model | B@1 | B@4 | M | R | C | S |
---|---|---|---|---|---|---|
Our Paper (ResNext101) | 81.1 | 39.6 | 29.6 | 59.1 | 133.5 | 23.2 |
Reproduced Model (ResNext101) | 81.2 | 39.9 | 29.6 | 59.1 | 133.7 | 23.3 |
Online Evaluation
We also report the performance of our model on the online COCO test server with an ensemble of four S2 models. The detailed online test code can be obtained in this repo.
Reference and Citation
Reference
[1] Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[2] Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue
Huang, and Rongrong Ji. Rstnet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15465–15474, 2021.
Citation
@inproceedings{S2,
author = {Pengpeng Zeng* and
Haonan Zhang* and
Jingkuan Song and
Lianli Gao},
title = {S2 Transformer for Image Captioning},
booktitle = {IJCAI},
pages = {1608--1614}
year = {2022}
}
Acknowledgements
Thanks Zhang et.al for releasing the visual features (ResNeXt-101 and ResNeXt-152). Our code implementation is also based on their repo.
Thanks for the original annotations prepared by M2 Transformer, and effective visual representation from grid-feats-vqa.