DALLE_clip_score
DALLE_clip_score copied to clipboard
Simple script to compute CLIP-based scores given a DALL-e trained model.
DALLE_clip_score
Simple script to compute CLIP scores based on a trained DALL-e model, using OpenAI's CLIP https://github.com/openai/CLIP. CLIP scores measures the compatibility between an image and a caption. The raw value is using cosine similarity, so it is between -1 and 1. In CLIP, the value is scaled by 100 by default, giving a number between -100 and 100, where 100 means maximum compatibility between an image and text. As mentioned in https://arxiv.org/abs/2104.14806, it is rare that the score is negative, but we clamp it to have a number between 0 and 100 anyways. Typical values are around 20-30.
How to install ?
- Install CLIP from https://github.com/openai/CLIP
- Install DALL-E lucidrains implementation https://github.com/lucidrains/DALLE-pytorch
-
python setup.py install
How to use ?
Here is an example:
clip_score --dalle_path dalle.pt --image_text_folder CUB_200_2011 --taming --num_generate 1 --dump
here:
-
dalle_path
is the path of the model trained with DALL-E using https://github.com/lucidrains/DALLE-pytorch -
image_text_folder
is the folder of the dataset following https://github.com/lucidrains/DALLE-pytorch/loader.py format -
taming
: specify that we use taming transformers as an image encoder -
num_generate
: number of images to generate per caption -
dump
: save all the generated images in the folderoutputs
(by default) and their respective metrics
Example output:
CLIP_score_real 30.1826171875
CLIP_score 26.7392578125
CLIP_score_top1 26.7392578125
CLIP_score_relative 0.8892822265625
CLIP_score_relative_top1 0.8892822265625
CLIP_atleast 0.7466491460800171
Note that all the metrics will also be saved on clip_score.json
by default.
-
CLIP_score_real
: average CLIP score for real images -
CLIP_score
: average CLIP score for all generated images. -
CLIP_score_top1
: for each caption, retain the generated image with best CLIP score, then compute the average CLIP score like inCLIP_score
. -
CLIP_score_relative
: similar to https://arxiv.org/abs/2104.14806, we compute CLIP score of the generated image divided by the CLIP score of the real image, then average. In general, between 0 and 1, although it can be bigger than 1. Bigger than 1 means the CLIP score of the generated image is higher. -
CLIP_score_relative_top1
: same asCLIP_score_relative
but using the top CLIP score like inCLIP_score_top1
. -
CLIP_atleast
: for each caption, it is 1 if CLIP score can reach at least--clip_thresh
(by default 25), 0 if not, then we average over all captions. This score gives a number between 0 and 1.
For all scores, the higher, the better.