InternVL [Question] Inference mode on InternVL2

Thank you for your elegant work! I am wondering if InternV2 has the same function like InternVL-C in the previous versions that support cross-modal feature retrieval, or how I can get aligned embeddings for image-text pairs? I have tested extracting features by calling 'model.extract_feature()' for images and 'model.language_model.get_input_embeddings()' for texts, but the generated embeddings show very low similarities. Thanks again for your precious time!

Sep 25 '24 09:09 XYxiyang

In other words, can I simply get embeddings for images and texts separately that share high similarities?

Sep 25 '24 09:09 XYxiyang

same question here

Oct 14 '24 18:10 xiexh20

same question here. Or how to extract combined features?

Oct 15 '24 02:10 wangpichao

According to the authors, versions later than InternVL1 are trained solely by next token prediction; therefore, they no longer support embedding retrieval. Thanks a lot.

Oct 15 '24 04:10 XYxiyang