VSUA-Captioning
VSUA-Captioning copied to clipboard
Code for "Aligning Linguistic Words and Visual Semantic Units for Image Captioning", ACM MM 2019
Aligning Linguistic Words and Visual Semantic Units for Image Captioning
Introduction
VSUA model represents images as structured graphs where nodes are the so-called Visual Semantic Units (VSUs): object, attribute, and relationship units. Our VSUA model makes use of the alignment nature between caption words and VSUs.
Citation
If you find this code useful in your research then please cite
@inproceedings{guo2019vsua,
title={Aligning Linguistic Words and Visual Semantic Units for Image Captioning},
author={Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu},
booktitle={ACM MM},
year={2019}}
Requirements
- Cuda-enabled GPU
- Python 2.7, and PyTorch >= 0.4
- Cider (already been added as a submodule)
- Optionally:
- coco-caption (already been added as a submodule): If you'd like to evaluate BLEU/METEOR/CIDEr scores
- tensorboardX: If you want to visualize the loss histories (needs to install TensorFlow).
To install all submodules: git clone --recursive https://github.com/ltguo19/VSUA-Captioning.git
Prepare Data
For more details and other dataset, see ruotianluo/self-critical.pytorch
1. Download COCO captions and preprocess them
Download preprocessed coco captions from link from Karpathy's homepage. Extract dataset_coco.json
from the zip file and copy it into data/
. This file provides preprocessed captions and also standard train-val-test splits.
Then do:
$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk
prepro_labels.py
will map all words that occur <= 5 times to a special UNK
token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into data/cocotalk.json
and discretized caption data are dumped into data/cocotalk_label.h5
.
2. Download Bottom-Up features
We use the pre-extracted bottom-up image features. Download pre-extracted feature from link (we use the adaptive one in our experiments). For example:
mkdir data/bu_data; cd data/bu_data
wget https://storage.googleapis.com/bottom-up-attention/trainval.zip
unzip trainval.zip
Then:
python script/make_bu_data.py --output_dir data/cocobu
This will create data/cocobu_fc
, data/cocobu_att
and data/cocobu_box
.
3. Download image scene graph data
We use the scene graph data from yangxuntu/SGAE. Download the files coco_img_sg.zip
and coco_pred_sg_rela.npy
from this link and put them into the folder data
and then unzip them.
coco_img_sg.zip
contains scene graph data for each image, including object labels and attributes labels for each box in the adaptive bottom-up data, and the semantic relationship labels between boxes. coco_pred_sg_rela.npy
contains the vocabularies for the object, attribute and relation labels.
4. Extract geometry relationship data
Download the files vsua_box_info.pkl
from this link, which contains the size of each box and the width/height of each image.
Then do:
python scripts/cal_geometry_feats.py
python scripts/build_geometry_graph.py
to extract the geometry relation features and build the geometry graph. This will createdata/geometry_feats-undirected.pkl
and data/geometry-iou0.2-dist0.5-undirected
.
Overall, the data folder should contain these files/folders:
cocotalk.json # additional information about images and vocab
cocotalk_label.h5 # captions
coco-train-idxs.p # cached token file for cider
cocobu_att # bottom-up feature
cocobu_fc # bottom-up average feature
coco_img_sg # scene graph data
coco_pred_sg_rela.npy # scene graph vocabularies
vsua_box_info.pkl # boxes and width and height of images
geometry-iou0.2-dist0.5-undirected # geometry graph data
Training
1. Cross-entropy loss
python train.py --gpus 0 --id experiment-xe --geometry_relation True
The train script will dump checkpoints into the folder specified by --checkpoint_root
and --id
.
2. Reinforcement learning with CIDEr reward
python train.py --gpus 0 --id experiment-rl --geometry_relation True --learning_rate 5e-5 --resume_from experiment-xe --resume_from_best True --self_critical_after 0 --max_epochs 50
-
--gpu
specifies the GPU used to run the model.--id
is the name of this experiment and all information and checkpoints will be dumped tocheckpoint_root/id
folder. -
--geometry_relation
specifies the type of relationship to use. True: use geometry relationship, False: use semantic relationship. - To resume training, you can specify
--resume_from
option to be the experiment id you want to resume from, and use--resume_from_best
to choose whether to resume from the best-performing checkpoint or the latest checkpoint. - If you have TensorFlow, the loss histories are automatically dumped into
checkpoint_root/id
, and can be visualized using tensorboard bysh script/tensorboard.sh
. - If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross-entropy loss, use
--language_eval 1
option, but don't forget to download the coco-caption code intococo-caption
directory. - For more options, see
opts.py
. And see self-critical.pytorch for more training guidance.
Acknowledgement
This code is modified from Ruotian Luo's brilliant image captioning repo ruotianluo/self-critical.pytorch. We use the visual features provided by Bottom-Up peteanderson80/bottom-up-attention, and the scene graph data provided by yangxuntu/SGAE. Thanks for their works! If you find this code helpful, please consider citing their corresponding papers and our paper.