DALLE2-pytorch
DALLE2-pytorch copied to clipboard
automated benchmark over big list of prompts
our latest decoder run are currently a bit hard to evaluate, the during training metrics and generation all seem pretty good
the generation also seem pretty good overall when using with the prior and an upsampler

however is it as good as the original dalle2 ? not quite, missing a bit in generality
a nice way to solve this is to evaluate on a list of difficult prompts automatically and then display that nicely in wandb
it has been done by dalle mini, imagen, dalle2 and parti with pretty good success
see https://gist.github.com/rom1504/b5db3c98c5485c0ec5a1d22d79ca083a and https://gist.github.com/rom1504/a16b2259510632637e0978aad8933d8b and https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mega--VmlldzoxODMxMDI2
let's build such an evaluated benchmarking, to know whether our models are getting better
@nousr has started work on this
good idea! yea we should definitely compile a set of prompts sorted by difficulty, open sourced in some repository
then it would be trivial to preencode them for eval across a range of models and roll our human eyes over it
proposed evaluation procedure after discussion. we want to know
- does the prior work
- does the decoder work
on
- real image/caption
- ood generated image/prompt (eg https://twitter.com/jd_pressman/status/1543778382894141441)
Procedure:
- take image + prompt
- compute clip emb from image, compute clip emb from text
- give those to decoder -> get image_without_prior
- give prior(clip_text) + clip_text to decoder -> get image_with_prior
- display real image, image_without_prior, image_with_prior
Then as metrics:
- compare images from 5
- compare embeddings from 2 and 4
This should tell us whether the prior and the decoder are working, and whether we can handle ood