CLIP
CLIP copied to clipboard
How does CLIP think without the prompts?
Hi,
Thank you for this amazing contribution!
The central question that I want to raise is - without entering any prompts, what all words/phrases is CLIP thinking. So, I want to know if its possible if this model is able to give out all (text) outputs for a given image, and not just the ones queried by prompts. That is for an image the set of "all" words/phrases that the model has found (in decreasing order).
If its possible, how do I begin? Need some help here.
It sounds like you're attempting to maximize the image embeddings by searching text embeddings.
Researchers did that here: https://distill.pub/2021/multimodal-neurons/ But people were running into issues trying to figure out how to reproduce it: https://github.com/openai/CLIP-featurevis/issues/2