CLIP Question: Image to Token(s) encoder?

Question: Image to Token(s) encoder?

Open chi0tzp opened this issue 2 years ago • 2 comments

Hi, I'm interested in training an encoder that maps images to the token space.

Maybe something like a ResNet backbone that learns to map input images to token embeddings. The latter, after passed through the CLIP text encoder should lead to image embeddings close to the image embeddings produced by the CLIP image encoder.

Has anyone tried anything like this?

Thank you!

Dec 23 '21 11:12 chi0tzp

Are you familiar with CoOp? It does something similar:

https://github.com/KaiyangZhou/CoOp

Dec 23 '21 11:12 Rijgersberg

Hi @Rijgersberg, I wasn't aware of CoOp, many thanks for sharing!

Dec 23 '21 12:12 chi0tzp

CLIP CLIP copied to clipboard

Question: Image to Token(s) encoder?

CLIP
CLIP copied to clipboard