X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers (EMNLP 2020)

Authors: Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Ani Kembhavi
Paper
Blog
Demo
Slideslive Presentation

Summary

Recent multi-modal transformers have achieved tate of the art performance on a variety of multimodal discriminative tasks like visual question answering and generative tasks like image captioning. This begs an interesting question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements. X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.

Demo

Try out AI2 Computer Vision Explorer Demo!

Install

Python packages

conda create -n xlxmert python=3.7
conda activate xlxmert
cd  x-lxmert
pip install -r ./requirements.txt

Mask-RCNN-benchmark (for feature extraction)
- Please follow the original installation guide.
Faiss (for K-means clustering)
- Please follow the original installation guide.

Code structure

# Store images, features, and annotations
./datasets
    COCO/
        images/
        featuers/
    VG/
        images/
        features/
    GQA/
        images/
        features/
    nlvr2/
        images/
        features/
    data/               <= Store text annotations (*.json) for each split
        lxmert/
        vqa/
        gqa/
        nlvr2/

# Run feature extraction and k-means clustering
./feature_extraction

# Train image generator
./image_generator
    snap/       <= Store image generator checkpoints
    scripts/    <= Bash scripts for training image generator

# Train X-LXMERT
./x-lxmert
    src/
        lxrt/           <= X-LXMERT model class implementation (inherits huggingface transformers' LXMERT class)
        pretrain/       <= X-LXMERT Pretraining
        tasks/          <= Fine-tuning on downstream tasks (VQA, GQA, NLVR2, Image generation)
    snap/       <= Store X-LXMERT checkpoints
    scripts/    <= Bash scripts for pretraining, fine-tuning, and image generation

Feature extraction

Please checkout ./feature_extraction for download pre-extracted features and more details.

cd ./feature_extraction

# For Pretraining / VQA
python coco_extract_grid_feature.py --split train
python coco_extract_grid_feature.py --split valid
python coco_extract_grid_feature.py --split test

# For Pretraining
python VG_extract_grid_feature.py

# For GQA
python GQA_extract_grid_feature.py

# For NLVR2
python nlvr2_extract_grid_feature.py --split train
python nlvr2_extract_grid_feature.py --split valid
python nlvr2_extract_grid_feature.py --split test

# K-Means clustering
python run_kmeans.py --src mscoco_train --tgt mscoco_train mscoco valid vg

Pretraining

Pretrain on LXMERT Pretraining data

cd ./x-lxmert/
bash scripts/pretrain.bash

or download pretrained checkpoint

wget -O x-lxmert/snap/pretrained/x_lxmert/Epoch20_LXRT.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/x-lxmert/Epoch20_LXRT.pth

Finetuning

VQA

cd ./x-lxmert/
bash scripts/finetune_vqa.bash
bash scripts/test_vqa.bash

GQA

cd ./x-lxmert/
bash scripts/finetune_gqa.bash
bash scripts/test_gqa.bash

NLVR2

cd ./x-lxmert/
bash scripts/finetune_nlvr2.bash
bash scripts/test_nlvr2.bash

Image generation

Train image generator on MS COCO

cd ./image_generator/
bash scripts/train_generator.bash

or download pretrained checkpoints

wget -O image_generator/snap/pretrained/G_60.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/image_generator/G_60.pth

Sample images

cd ./x-lxmert/
bash scripts/sample_image.bash

Reference

@inproceedings{Cho2020XLXMERT,
  title={X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers},
  author={Cho, Jaemin and Lu, Jiasen and Schwenk, Dustin and Hajishirzi, Hannaneh and Kembhavi, Aniruddha},
  booktitle={EMNLP},
  year={2020}
}

x-lxmert
x-lxmert copied to clipboard

Metadata

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers (EMNLP 2020)

Summary

Demo

Install

Code structure

Feature extraction

Pretraining

Pretrain on LXMERT Pretraining data

or download pretrained checkpoint

Finetuning

VQA

GQA

NLVR2

Image generation

Train image generator on MS COCO

or download pretrained checkpoints

Sample images

Reference

← Metadata

Owner

Metadata

x-lxmert x-lxmert copied to clipboard

Metadata

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers (EMNLP 2020)

Summary

Demo

Install

Code structure

Feature extraction

Pretraining

Pretrain on LXMERT Pretraining data

or download pretrained checkpoint

Finetuning

VQA

GQA

NLVR2

Image generation

Train image generator on MS COCO

or download pretrained checkpoints

Sample images

Reference

← Metadata

Owner

Metadata

x-lxmert
x-lxmert copied to clipboard