Any plans to use Long-CLIP to extend text input token limit?

Open lennartmoritz opened this issue 1 year ago • 0 comments

If i read your paper right, you have frozen the CLIP text encoder and only aligned the other modalities. Do you think a pretrained Long-CLIP model could be used as a drop in replacement for LanguageBind to extend the token limit?

May 14 '24 12:05 lennartmoritz