T-Rex Multimodal Prompt Collaboration

Dear authors,

I'm very interested in the text-visual interaction and collaboration described in your paper. In real-world tasks, multiple prompts are often used, and it's challenging to make these different prompts coordinate effectively to achieve the best results. I've read your paper and tried to find the corresponding implementation of text-visual collaboration in the code, but it seems that most parts directly call the T-Rex2 API for processing. This didn't fully answer my questions. I'd like to know the specific implementation details of how to coordinate multiple types of prompts (such as text and visual prompts) to work together. Could you please provide some insights or guidance on this?

Thank you!

May 23 '25 06:05 lmr2706

Hi @lmr2706 hank you for your interest in our work. Currently, the T-Rex2 API only supports visual prompts, not text prompts. Therefore, this functionality cannot be achieved through the API. However, I can explain how we actually handle it.

During training, we incorporated a contrastive alignment loss, which aligns text prompts and visual prompts into the same embedding space. For example, suppose we have a text prompt embedding for “dog” with a shape of 1×256, and three visual prompt embeddings for different dogs with a shape of 3×256. We simply concatenate them into a 4×256 embedding, and then take the average to obtain a final 1×256 embedding.

May 25 '25 03:05 Mountchicken

Hi @Mountchicken ，I have some questions in some Model Details, Trex2 paper say they use CLIP as text encoder and use [CLS] token to get the text embedding. If in one minibatch has 50 categories ( contain negative categories),how do you get [CLS]token for all of these categories using CLIP. such as the input text is "dog , cat, bike ....." , how do you design the [CLS] token template in input texts

Nov 12 '25 13:11 JacheSha