CLIP If the text embedding can be recovered to text?

If the text embedding can be recovered to text?

Open Zhangwenyao1 opened this issue 1 year ago • 2 comments

Thanks for your excellent work! I want to know if the text embedding can be recovered to text.

Feb 27 '24 07:02 Zhangwenyao1

I believe you'd have to train your own decoder to make something like that work

Mar 31 '24 19:03 hamza13-12

If you mean "getting a CLIP opinion about an image", yes, you can do that using gradient ascent. You can optimize for the text that is "most alike" the image features, and get a "stochastic CLIP textual descriptions" of the image.

If you feed an image of a cat, CLIP will certainly conclude "cat", amongst many other things that may be puzzling to humans. For example, it may conclude "map" about your tabby cat's fur pattern. Or construct conjoined long words like "hallucinkaleidodimensional" about something colorful.

I made an intuitive and interactive GUI with attention visualization (you can see "where CLIP was looking" for a given word): https://github.com/zer0int/CLIP-XAI-GUI

Or, for batch processing via the command-line: https://github.com/zer0int/CLIP-text-image-interpretability

May 24 '24 23:05 zer0int

CLIP CLIP copied to clipboard

If the text embedding can be recovered to text?

CLIP
CLIP copied to clipboard