DALLE2-pytorch Build a fair evaluation of the prior

trafficstars

We're starting to have our first prior now. (PR of the training script coming soon)

Time to evaluate Ideas:

[x] Mse on test set
[ ] Zero shot eval on image net Class Text -> text emb -> image emb -> ranking https://github.com/LAION-AI/project-menu/issues/13
[x] Clip guided generation with the prior

If you have more ideas please share, I may be missing some obvious things.

We have more volunteers that want to help so I'll point some here :)

Apr 28 '22 22:04 rom1504

Already?! 🙏💯🎉

Apr 28 '22 22:04 lucidrains

I was going to add training scripts for all the components tomorrow morning ... 😂

Apr 28 '22 22:04 lucidrains

I think for the clip guided generation, you will still need a decoder conditioned on the clip image embedding, tho you can probably get away with a small resolution net for starters, just to validate

Apr 28 '22 22:04 lucidrains

Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?

Apr 28 '22 23:04 xiankgx

@xiankgx 👋 are you working with Laion? you should, because they have a humongous (and smaller test) datasets

Apr 29 '22 00:04 lucidrains

Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?

Do you have a link to the CLIP that you used? I can try to incorporate it tomorrow using https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L95

Apr 29 '22 01:04 lucidrains

Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?

Do you have a link to the CLIP that you used? I can try to incorporate it tomorrow using https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L95

I am using the CLIP from this link: https://github.com/openai/CLIP

Apr 29 '22 01:04 xiankgx

ohh got it, looks like they finally got it to be pip installable. I'll take a look tomorrow at an adapter!

Apr 29 '22 01:04 lucidrains

ok, the plan will be to automatically use the openai clip by setting a use_openai_clip on both the Decoder and DiffusionPrior, which will allow researchers to skip the first step in the whole process

Apr 29 '22 03:04 lucidrains

@xiankgx wave are you working with Laion? you should, because they have a humongous (and smaller test) datasets

How can I help? Would be glad to help.

Apr 29 '22 04:04 xiankgx

Here is another benchmark we should definitely use:

https://github.com/cat-state/clip-retrieval/blob/main/clip_retrieval/clip_benchmark.py

The purpose of this benchmark should be to evaluate the ability of a clip model to use retrieve correct or at least semantically close samples from a given dataset.

At first basic version of this script should be focused on image-text pairs.

Later, it would be nice to have a general version of this benchmark that could be used for any pair of modalities like audio, video, text and images.

Let’s say every sample has a component A (e.g. image) and a component B (e.g. text).

take sample A-B and look with A in B-kNN index for the closest neighbor A’- B’. —> check if the B’ = B, or even better, use a similarity encoder for the modality B to estimate how similar B and B’ are. (for text e.g. with this https://huggingface.co/sentence-transformers/all-mpnet-base-v2) Then take B from sample A-B and look in A-kNN index for the closest neighbor sample A’-B’. —> check if A’=A alternatively, use a similarity encoder for the modality A to estimate how similar A and A’ are. If there is no good single modality encoder to measure the semantic similarity of A and A’ (Like with Image- Image pairs at the moment), —> Take the similarity of B and B’ as a proxy (e.g. if B is text, check if the text B’ that belongs to the retrieved sample A’-B’ is semantically close to the text B we used for the query of the image index ) Calculate the mean and the standard deviation of the similarities for A->B and B->A kNN queries for all samples in the evaluation set / n samples

https://github.com/rom1504/clip-retrieval

Apr 29 '22 11:04 christophschuhmann

Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?

Do you have a link to the CLIP that you used? I can try to incorporate it tomorrow using https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L95

I am using the CLIP from this link: https://github.com/openai/CLIP

https://github.com/lucidrains/DALLE2-pytorch/tree/0.0.67#openai-clip ok, should be OpenAI clip compatible now, at some point I'll make it OpenCLIP compatible as well

Apr 29 '22 18:04 lucidrains

you can depend on clip-anytorch if you want openai clip from pypi (that's my pypi deployment of it)

Apr 30 '22 02:04 rom1504

nice! I'll refactor to use it maybe next week :)

Apr 30 '22 03:04 lucidrains

Oh I mean there is no change to be made except pip install clip-anytorch Everything else is the same, including the imports

Apr 30 '22 11:04 rom1504

@rom1504 it works! :pray: https://github.com/lucidrains/DALLE2-pytorch/releases/tag/0.0.73

Apr 30 '22 13:04 lucidrains

https://huggingface.co/rom1504/dalle2-diffusion-prior/resolve/main/1651432174.5708027_saved_model.pth here's a first checkpoint for the prior

let's start evaluation work!

May 01 '22 19:05 rom1504

https://colab.research.google.com/drive/1kUYIvWje6CVO9llqY_9bYYk6zMNh1sSh?usp=sharing first eval from theo comparing image predicted with real embedding not amazing

let's include that kind of metrics in the training to see if things improve with time

May 01 '22 21:05 rom1504

I should note here, there could be a good chance this could be some form of normalization issue on my end causing the results to be worse as this is my first time playing around with the DALLE2-pytorch code - I'm going to play around with it some more and see

May 01 '22 21:05 TheoCoombes

@TheoCoombes so one thing to note is that in the paper, they actually sampled a couple image embeddings (well, just 2 i guess), and then selected the one with the highest similarity to the text embedding. so it seems they must have encountered the same difficulties. the logic is in this function here if you need it! https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L831

May 02 '22 01:05 lucidrains

https://colab.research.google.com/drive/10P81dVS7YKCMUHF3FA7WD3Q_mp-cCWIA#scrollTo=VVElbFFcb5T7 new eval with the new checkpoint https://huggingface.co/krish240574/Dalle2-Diffusion-Prior/blob/main/1651473037.600823_saved_model.pth

now it works, we improve the similarity! previous checkpoint was 0.27 -> 0.09 now 0.27 -> 0.28

we're going to make a PR to evaluate that automatically during training since it's cheap

May 02 '22 15:05 rom1504

@rom1504 very nice! :D

May 02 '22 18:05 lucidrains

I have a new model/eval set. This run tries out optimization parameters that are more in line with what the paper specifies.

It seems to show better performance, resulting in a similarity of ~0.78 (up from 0.28, on the previously used image). However, more work should be done on benchmarking since early testing in the discord shows that unrelated prompts can also score relatively high similarities.

Here is the model repository (hugging face link). And a W&B report of a 25M datapoint run.

May 02 '22 20:05 nousr

I've got a 300M point run going with the improved norm (re: #60).

I've also attempted to add a way to track the similarity with an unrelated text embedding. In short I shuffle the text embeddings in an effort to simulate "unrelated" prompts...

I'll PR the code when the run finishes and have you guys take a look at it to make sure the code & run results match what we would expect.

You can keep an eye on the run here (wandb report)

May 06 '22 01:05 nousr

I trained a prior with ViT-B/32 text and image embeddings using the train_diffusion_prior.py script. Additionally i tracked the score for a fixed text embedding vs predicted image embedding. Ideally it should decrease over time since the text and images will be unrelated - it does but not as much and my weights toward the end still give a high score for unrelated text and predicted image embeds

CosineSim(Unrelated_Text_Embed, Prior(Text)) is upper graph

CosineSim(Related_Image_Embed, Prior(Text)) is lower graph

May 06 '22 18:05 NasirKhalid24

Based on experience with CLIP, many texts can have the same cosine similarity ballpark even if some texts are better than others. Perhaps we can use softmax accuracy instead between generated image embeddings and input (text embeddings).

May 07 '22 12:05 xiankgx

I just launched a run to test out a new metric, using deep-image-prior (#59), to generate images from the diffusion prior at 10k step intervals during training.

If we get to ~50k steps and it looks like it works, then we can create a much more diverse prompt-set to evaluate the prior on.

May 12 '22 18:05 nousr

https://github.com/lucidrains/DALLE2-pytorch/issues/23#issuecomment-1127011855 we can share the preprint once it gets released on arxiv

May 15 '22 20:05 lucidrains

this is almost done, zero shot eval might be the last thing here, some people are on it

Jul 08 '22 21:07 rom1504

DALLE2-pytorch DALLE2-pytorch copied to clipboard

Build a fair evaluation of the prior

DALLE2-pytorch
DALLE2-pytorch copied to clipboard