GroundingDINO
GroundingDINO copied to clipboard
Some questions on the details in paper
To the Authors
This is a very interesting and good work on visual grounding tasks with a Query-based detector. The paper is also well written and clear. Super interesting results with GLIGEN as well. I do have a few very specific questions about the implementation or concepts in the paper.
- As for the language-guided query selection. This module makes a lot of sense and you are basically saying, you want to extract location of the image tokens where they have the greatest responses with the text tokens. And then use these as the location queries in the Mixed-Query-Selection design in DINO. I notice you describe the outer product between text/img tokens as logits. My questions are (a). Is there any supervision on this level? If not, did you use any pretrained Vison Language initializations so that they naturally responds (b). Does it make more sense to use the normalized feature vectors so that the dot product is actually correlation. (c) what happens if the selected image tokens all have responses to the same text token or only a few text tokens, and if there is any way to separate them out like the 1st-stage training in Deformable DETR or DINO?
- As for the Sub-Sentence Level Text Feature. (a). How is the attention mask produced when dealing with a weak annotation such as image-caption pairs (Cap4M), did you take a noun extraction methods as described in DetCLIP? As a detailed example would be, how to generate the attention mask for a concept "fruit fly" or any human names such as "Harry Potter", when the detection dataset doesn't have this category. (b). And how to handle the input length limit as GLIP describes in their paper when you have over 1000 categories like LVIS during training/inference? Was there like a sparse negative category sampling strategy?
- Loss Function Is the negative class handled similar to the alignment loss described in GLIP or MDETR? I assume you apply sigmoid focal loss and the negative object queries simply learns the 0 from {0, 1} binary target?
- Last but not least, do you think it's possible to leverage other frameworks such as pretrained ALBEF, VLMo or even BeiTv3 and inject your design into it? If not, what do you think are the limitations of these frameworks.
Thank you.
Thanks for your questions. We will provide the demo with GLIGEN soon.
- (a) similar to DINO, we will calculate loss after the encoder, which will be supervised for the module. (b) It is a good question. We use the outer products to mimic the linear operation in classification, where no norm is used. I think it is worth trying to measure the influence of norms. (c) I have not deal with the situation you mentioned. But I think a well-trained model can make right responses to text tokens. If selected image tokens all have responses to the same text token, it may mean that objects of other texts do not exist in the image. That still a good question. We will try more corner cases for the model.
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
Thanks for your answer, still looking at your codes. I think this work is amazing.
Are the pseudo labels on CC3M and SBU the same as the ones used in GLIP/GLIPv2? Or you generated them yourselves? To my knowledge, Microsoft hasn't yet released their pseudo labels. Will you be releseasing them?
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
Nice work! According to your answer to question 2(b), when evaluating the model on the LVIS dataset, do you mean that you concatenate a part of category names to meet the limit of 256 text tokens and forward an image sample multiple times with different parts of category names, then merge the results as the final detection results of over 1000 categories on LVIS? Thanks!
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
Nice work! According to your answer to question 2(b), when evaluating the model on the LVIS dataset, do you mean that you concatenate a part of category names to meet the limit of 256 text tokens and forward an image sample multiple times with different parts of category names, then merge the results as the final detection results of over 1000 categories on LVIS? Thanks!
I am not the author, but I think they followed GLIP. Which is the same inference process as you described.
Are the pseudo labels on CC3M and SBU the same as the ones used in GLIP/GLIPv2? Or you generated them yourselves? To my knowledge, Microsoft hasn't yet released their pseudo labels. Will you be releseasing them?
Thanks for your questions. We use the GLIP-annotated data for training.
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
Nice work! According to your answer to question 2(b), when evaluating the model on the LVIS dataset, do you mean that you concatenate a part of category names to meet the limit of 256 text tokens and forward an image sample multiple times with different parts of category names, then merge the results as the final detection results of over 1000 categories on LVIS? Thanks!
I am not the author, but I think they followed GLIP. Which is the same inference process as you described.
Yes, we follow GLIP. We first run models multiple times with different category names and then merge outputs.
Are the pseudo labels on CC3M and SBU the same as the ones used in GLIP/GLIPv2? Or you generated them yourselves? To my knowledge, Microsoft hasn't yet released their pseudo labels. Will you be releseasing them?
Thanks for your questions. We use the GLIP-annotated data for training.
Do you know where to download it? I don't think they released publicly
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
@SlongLiu Very nice work! According to your answer to question 3, the loss function is similar to GLIP. I notice that GLIP assigns the negative category (background) to the last token of the sentence (maybe [EOS] token). Am I right? Did GroundingDINO use the same strategy?