Difference in visualization results between MMGroundingDINO and GroundingDINO on a single image
I performed detection on the image below using both MMGroundingDINO and GroundingDINO, with the following commands respectively:
python demo/image_demo.py 000000002299.jpg configs/grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py --weights groundingdino_swint_ogc_mmdet-822d7e9d.pth --texts 'person' (for GroundingDINO)
python demo/image_demo.py 000000002299.jpg configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth --texts 'persons'(for MMGroundingDINO)
As shown in the results, the predicted scores from MMGroundingDINO are generally lower, which leads to some missed detections. In contrast, GroundingDINO gives relatively higher scores.
Previously, I trained a prompt-based detector based on the DINO codebase and also observed the issue of low predicted scores. When I printed out the DINO baseline's prediction scores on this image, they were already quite low.
I'm wondering have you encountered a similar issue?
Looking forward to your reply~
Note: This image is from the COCO validation set, with the path: val2017/000000002299.jpg.