Grounded_3D-LLM icon indicating copy to clipboard operation
Grounded_3D-LLM copied to clipboard

Encoding referent tokens

Open Germany321 opened this issue 1 year ago • 4 comments

I notice referent tokens are interleaved in the output. Can multiple referent tokens appear in a single text prompt, such as "Describe the table and the chair ."?

Germany321 avatar Aug 23 '24 04:08 Germany321

Yes, it can occur multiple times. The current language data focuses mainly on single objects, which may limit performance with multiple referent tokens. Please refer to the instruction templates in the supplementary file for the well-trained instruction templates.

chenyilun95 avatar Aug 23 '24 04:08 chenyilun95

Thanks for your quik reply. Another question is that if there are multiple referent tokens in the prompt, how can you distinguish different referent scene queries? In above example, "Describe the table < /ref> and the chair < /ref>.", it seems that only decoding "< /ref>" token can not distinguish the two instances. How can you retreive the different object queries for table and chair respectively based on this referent token?

Germany321 avatar Aug 23 '24 05:08 Germany321

Prior scene queries can be decomposed into scene masks, enabling us to obtain the mapping between instances and queries. During training, a mask IoU greater than 0.3 is considered a positive match in supp Sec. B.

chenyilun95 avatar Aug 23 '24 07:08 chenyilun95

Thanks for the reply, I finally understand the mechanism.

Germany321 avatar Aug 28 '24 07:08 Germany321