DALLE2-pytorch automated benchmark over big list of prompts

automated benchmark over big list of prompts

Open rom1504 opened this issue 3 years ago • 2 comments

our latest decoder run are currently a bit hard to evaluate, the during training metrics and generation all seem pretty good

the generation also seem pretty good overall when using with the prior and an upsampler

example_3 example_5

however is it as good as the original dalle2 ? not quite, missing a bit in generality

a nice way to solve this is to evaluate on a list of difficult prompts automatically and then display that nicely in wandb

it has been done by dalle mini, imagen, dalle2 and parti with pretty good success

see https://gist.github.com/rom1504/b5db3c98c5485c0ec5a1d22d79ca083a and https://gist.github.com/rom1504/a16b2259510632637e0978aad8933d8b and https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mega--VmlldzoxODMxMDI2

let's build such an evaluated benchmarking, to know whether our models are getting better

@nousr has started work on this

Jul 09 '22 20:07 rom1504

good idea! yea we should definitely compile a set of prompts sorted by difficulty, open sourced in some repository

then it would be trivial to preencode them for eval across a range of models and roll our human eyes over it

Jul 10 '22 16:07 lucidrains

proposed evaluation procedure after discussion. we want to know

does the prior work
does the decoder work

real image/caption
ood generated image/prompt (eg https://twitter.com/jd_pressman/status/1543778382894141441)

Procedure:

take image + prompt
compute clip emb from image, compute clip emb from text
give those to decoder -> get image_without_prior
give prior(clip_text) + clip_text to decoder -> get image_with_prior
display real image, image_without_prior, image_with_prior

Then as metrics:

compare images from 5
compare embeddings from 2 and 4

This should tell us whether the prior and the decoder are working, and whether we can handle ood

Jul 10 '22 16:07 rom1504

DALLE2-pytorch DALLE2-pytorch copied to clipboard

automated benchmark over big list of prompts

DALLE2-pytorch
DALLE2-pytorch copied to clipboard