image-captioning
image-captioning copied to clipboard
Image captioning models "show and tell" + "show, attend and tell" in PyTorch
image-captioning
Implementations for image captioning models in PyTorch, different types of attention mechanisms supported. Currently only provides pretrained ResNet152 and VGG16 with batch normalization as encoders.
Model supported:
FC from "show and tell"
Att2all from "show and tell"
Att2in from "Self-critical Sequence Training for Image Captioning"
Spatial attention from "Knowing When to Look: Adaptive Attention via
A Visual Sentinel for Image Captioning"
Adaptive attention from "Knowing When to Look: Adaptive Attention via
A Visual Sentinel for Image Captioning"
Evaluate captions via capeval/
, which is derived from tylin/coco-caption with minor changes for a better Python 3 support
Requirements
- MSCOCO original dataset, please put them in the same directory, e.g.
COCO2014/
, and modify theCOCO_ROOT
inconfigs.py
, you can get them here: - Instead of using random split, Karpathy's split is required, please put it in the
COCO_PATH
- PyTorch v0.3.1 or newer with GPU support.
- TensorBoardX
Usage
1. Preprocessing
First of all we should preprocess the images and store them locally. Specifying phases is available if parallel processing is required.
All preprocessed images are stored in HDF5 databases in COCO_ROOT
python preprocess.py
2. Extract image features
Extract the image features offline by the encoder and store them locally. Currently only ResNet152 and VGG16 with batch normalization are supported.
python extract.py --pretrained=resnet --batch_size=10 --gpu=0
3. Training the model
Training can be performed only after the image features are extracted.
If training on the full dataset is desired, please specify the train_size
as -1
Immediate evaluation with beam search after training is also available, please set the flag as true
.
The scores are stored in scores/
python train.py --train_size=100 --val_size=10 --test=10 --epoch=30 --verbose=10 --learning_rate=1e-3 --batch_size=10 --gpu=0 --pretrained=resnet --attention=none --evaluation=true
4. Offline evaluation
After the training is over, an offline evaluation can be performed.
All generated captions are stored in results/
python evaluation.py --train_size=100 --test_size=10 --num=3 --batch_size=10 --gpu=10 --pretrained=resnet --attention=none --encoder=<path_to_encoder> --decoder=<path_to_decoder>
Note that the train_size
must match the size of images for training
5. Visualize attention weights
For the model with attention.
python show_attention.py --phase=test --pretrained=resnet --train_size=-1 --val_size=-1 --test_size=-1 --num=10 --encoder=<path_to_encoder> --decoder=<path_to_decoder> --gpu=0
Results
Good captions
Okay captions
Bad captions
Attention
Good results
Bad results
Performance
Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr |
---|---|---|---|---|---|
Baseline (Nearest neighbor) | 0.48 | 0.281 | 0.166 | 0.1 | 0.383 |
FC | 0.720 | 0.536 | 0.388 | 0.286 | 0.805 |
Att2in | 0.732 | 0.553 | 0.402 | 0.296 | 0.837 |
Att2all | 0.732 | 0.554 | 0.403 | 0.296 | 0.838 |
Spatial attention | 0.725 | 0.537 | 0.389 | 0.287 | 0.812 |
Adaptive attention | 0.716 | 0.524 | 0.379 | 0.278 | 0.808 |
NeuralTalk2 | 0.625 | 0.45 | 0.321 | 0.23 | 0.66 |
Show and Tell | 0.666 | 0.461 | 0.329 | 0.27 | - |
Show, Attend and Tell | 0.707 | 0.492 | 0.344 | 0.243 | - |
Adaptive Attention | 0.742 | 0.580 | 0.439 | 0.266 | 1.085 |
Neural Baby Talk | 0.755 | - | - | 0.347 | 1.072 |
best models:
Model train_size test_size learning_rate weight_decay batch_size beam_size dropout FC -1 -1 2e-4 0 512 7 0 Att2in -1 -1 5e-4 1e-4 256 7 0 Att2all -1 -1 5e-4 1e-4 256 7 0 Spatial attention -1 -1 2e-4 1e-4 256 7 0 Adaptive attention -1 -1 2e-4 1e-4 256 7 0