Grounded_3D-LLM Encoding referent tokens

I notice referent tokens are interleaved in the output. Can multiple referent tokens appear in a single text prompt, such as "Describe the table and the chair ."?

Aug 23 '24 04:08 Germany321

Yes, it can occur multiple times. The current language data focuses mainly on single objects, which may limit performance with multiple referent tokens. Please refer to the instruction templates in the supplementary file for the well-trained instruction templates.

Aug 23 '24 04:08 chenyilun95

Thanks for your quik reply. Another question is that if there are multiple referent tokens in the prompt, how can you distinguish different referent scene queries? In above example, "Describe the table < /ref> and the chair < /ref>.", it seems that only decoding "< /ref>" token can not distinguish the two instances. How can you retreive the different object queries for table and chair respectively based on this referent token?

Aug 23 '24 05:08 Germany321

Prior scene queries can be decomposed into scene masks, enabling us to obtain the mapping between instances and queries. During training, a mask IoU greater than 0.3 is considered a positive match in supp Sec. B.

Aug 23 '24 07:08 chenyilun95

Thanks for the reply, I finally understand the mechanism.

Aug 28 '24 07:08 Germany321