Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Introduction

VSUA model represents images as structured graphs where nodes are the so-called Visual Semantic Units (VSUs): object, attribute, and relationship units. Our VSUA model makes use of the alignment nature between caption words and VSUs.

introduction image

Citation

If you find this code useful in your research then please cite

@inproceedings{guo2019vsua,
 title={Aligning Linguistic Words and Visual Semantic Units for Image Captioning},
 author={Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu},
 booktitle={ACM MM},
 year={2019}}

Requirements

Cuda-enabled GPU
Python 2.7, and PyTorch >= 0.4
Cider (already been added as a submodule)
Optionally:
- coco-caption (already been added as a submodule): If you'd like to evaluate BLEU/METEOR/CIDEr scores
- tensorboardX: If you want to visualize the loss histories (needs to install TensorFlow).

To install all submodules: git clone --recursive https://github.com/ltguo19/VSUA-Captioning.git

Prepare Data

For more details and other dataset, see ruotianluo/self-critical.pytorch

1. Download COCO captions and preprocess them

Download preprocessed coco captions from link from Karpathy's homepage. Extract dataset_coco.json from the zip file and copy it into data/. This file provides preprocessed captions and also standard train-val-test splits.

Then do:

$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk

prepro_labels.py will map all words that occur <= 5 times to a special UNK token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into data/cocotalk.json and discretized caption data are dumped into data/cocotalk_label.h5.

2. Download Bottom-Up features

We use the pre-extracted bottom-up image features. Download pre-extracted feature from link (we use the adaptive one in our experiments). For example:

mkdir data/bu_data; cd data/bu_data
wget https://storage.googleapis.com/bottom-up-attention/trainval.zip
unzip trainval.zip

Then:

python script/make_bu_data.py --output_dir data/cocobu

This will create data/cocobu_fc, data/cocobu_att and data/cocobu_box.

3. Download image scene graph data

We use the scene graph data from yangxuntu/SGAE. Download the files coco_img_sg.zip and coco_pred_sg_rela.npy from this link and put them into the folder data and then unzip them. coco_img_sg.zip contains scene graph data for each image, including object labels and attributes labels for each box in the adaptive bottom-up data, and the semantic relationship labels between boxes. coco_pred_sg_rela.npy contains the vocabularies for the object, attribute and relation labels.

4. Extract geometry relationship data

Download the files vsua_box_info.pkl from this link, which contains the size of each box and the width/height of each image. Then do:

python scripts/cal_geometry_feats.py
python scripts/build_geometry_graph.py

to extract the geometry relation features and build the geometry graph. This will createdata/geometry_feats-undirected.pkl and data/geometry-iou0.2-dist0.5-undirected.

Overall, the data folder should contain these files/folders:

cocotalk.json         	# additional information about images and vocab
cocotalk_label.h5       # captions
coco-train-idxs.p       # cached token file for cider
cocobu_att              # bottom-up feature
cocobu_fc               # bottom-up average feature
coco_img_sg             # scene graph data
coco_pred_sg_rela.npy   # scene graph vocabularies
vsua_box_info.pkl       # boxes and width and height of images
geometry-iou0.2-dist0.5-undirected  # geometry graph data

Training

1. Cross-entropy loss

python train.py --gpus 0 --id experiment-xe --geometry_relation True

The train script will dump checkpoints into the folder specified by --checkpoint_root and --id.

2. Reinforcement learning with CIDEr reward

python train.py --gpus 0 --id experiment-rl --geometry_relation True --learning_rate 5e-5 --resume_from experiment-xe --resume_from_best True --self_critical_after 0 --max_epochs 50

--gpu specifies the GPU used to run the model. --id is the name of this experiment and all information and checkpoints will be dumped to checkpoint_root/id folder.
--geometry_relation specifies the type of relationship to use. True: use geometry relationship, False: use semantic relationship.
To resume training, you can specify --resume_from option to be the experiment id you want to resume from, and use --resume_from_best to choose whether to resume from the best-performing checkpoint or the latest checkpoint.
If you have TensorFlow, the loss histories are automatically dumped into checkpoint_root/id, and can be visualized using tensorboard by sh script/tensorboard.sh.
If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross-entropy loss, use --language_eval 1 option, but don't forget to download the coco-caption code into coco-caption directory.
For more options, see opts.py. And see self-critical.pytorch for more training guidance.

Acknowledgement

This code is modified from Ruotian Luo's brilliant image captioning repo ruotianluo/self-critical.pytorch. We use the visual features provided by Bottom-Up peteanderson80/bottom-up-attention, and the scene graph data provided by yangxuntu/SGAE. Thanks for their works! If you find this code helpful, please consider citing their corresponding papers and our paper.

VSUA-Captioning
VSUA-Captioning copied to clipboard

Metadata

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Introduction

Citation

Requirements

Prepare Data

1. Download COCO captions and preprocess them

2. Download Bottom-Up features

3. Download image scene graph data

4. Extract geometry relationship data

Training

1. Cross-entropy loss

2. Reinforcement learning with CIDEr reward

Acknowledgement

← Metadata

Owner

Metadata

VSUA-Captioning VSUA-Captioning copied to clipboard

Metadata

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Introduction

Citation

Requirements

Prepare Data

1. Download COCO captions and preprocess them

2. Download Bottom-Up features

3. Download image scene graph data

4. Extract geometry relationship data

Training

1. Cross-entropy loss

2. Reinforcement learning with CIDEr reward

Acknowledgement

← Metadata

Owner

Metadata

VSUA-Captioning
VSUA-Captioning copied to clipboard