CLIP
CLIP copied to clipboard
Question: Image to Token(s) encoder?
Hi, I'm interested in training an encoder that maps images to the token space.
Maybe something like a ResNet backbone that learns to map input images to token embeddings. The latter, after passed through the CLIP text encoder should lead to image embeddings close to the image embeddings produced by the CLIP image encoder.
Has anyone tried anything like this?
Thank you!
Are you familiar with CoOp? It does something similar:
https://github.com/KaiyangZhou/CoOp
Hi @Rijgersberg, I wasn't aware of CoOp, many thanks for sharing!