close
close copied to clipboard
CLOSE: Cross-modaL transfer On Semantic Embeddings with CLIP
This repository contains the official code for the ICCV 2023 paper:
I can't believe there's no Images! Learning Visual Tasks Using only Language Supervision
This project trains models on pure-text data and then shows they can applied to the same tasks with visual inputs instead of text, thus demonstrating zero-shot cross-modal transfer. This is done by using the shared semantic embedding space of contrastive vision and language models.
More details can be found on our webpage.
Installation
Install pytorch, we have tested this code with torch 1.10.1 and 1.11.0 Then install the other requirements with:
pip install -r requirements.txt
Finally download the needed datasets, data should be saved in the paths stored in close/file_paths.py. The data can be downloaded automatically using this script:
python close/download.py
This will download about 45G of data. Data can also be manually downloaded these sources:
COCO data:
The 2014 train and validation images for COCO should be put in ~/data/coco/images The annotations should be saved to ~/data/coco/annotations.
COCO Captions:
We use the Karpathy Split found here The coco_dataset.json file should be put into ~/data/coco.
Visual Entailment:
The data build from here. By default the annotations should be in ~/data/SNLI_VE and the images should be in ~/data/SNLI_VE/Flickr30K
VQA:
VQA annotation from here, which should be saved ~/data/vqa2
VQA-E:
Download the files from here, by default the files should be put into ~/data/vqa-e
Visual News:
We use the Visual News dataset, a large corpus consisting of both news images and articles from several news sources. More details can be found here.
Email sophiag[at]allenai[dot]org for trained models on visual news.
Adapters
The linear and covariance adapters we used in our ablations can be downloaded directly from
AWS, see download_adapters
in coco/close/download.py
Training
Training is done with cross/experiments/train.py,
python close/experiments/train.py --data {vqa|vqa-e|ve|s-cap|m-cap} --output_dir path/to/output/dir
The data
controls which dataset to train on, s-cap is captioning in the single captioning
setting and m-cap is the multiple captioning setting.
The script will use our default values, see the command line args for how to change the parameters.
Evaluation
The evaluations can be done with eval.py
, for example:
python coco/experiments/eval.py path/to/model evqa --output_name default
Trained Models
Each model includes a model.json
and state-ep8.pth
file, the E-VQA model can be downloaded like this:
mkdir model
mkdir model/r0
wget https://ai2-prior-close.s3.us-west-2.amazonaws.com/models/evqa/model.json -O model/model.json
wget https://ai2-prior-close.s3.us-west-2.amazonaws.com/models/evqa/r0/state-ep8.pth -O model/r0/state-ep8.pth
Other models can be downloaded by replace evqa
with these names:
- s-cap: Captioning (Single):
- m-cap: Captioning (Multiple):
- vqa: VQA (note trained on train+val):
- evqa: EVQA:
- ve: Visual Entailment: