Chuofan Ma
Chuofan Ma
Hi there, thanks for your interest in our work. Here are some tips you may follow to finetune the model on customized datasets: 1. Format your data. There are various...
Thanks for the feedback. The bug occurs as the program is looking for a local DINOv2-L checkpoint to initialize CustomDDETRModel. This is not an expected behavior. A quick fix is...
We inherited the mmcv folder from GPT4ROI. I think it is originated from mmcv==1.4.7.
Sorry, it's a typo here. It should be `conv_temp='llava'`. Thanks for your feedback.
You can simply use `Locate a person wearing a yellow hat in the image.` or something else like that. Just remember to enclose the referring expression with `` and ``.
Yes, such hallucination is probably caused by training data - for grounding training, we only got positive QA pairs, i.e., the object mentioned in the question is guaranteed to occur...
Thanks for your interest in our work. For referring queries, you do not need to replace `` or `` with other texts. Actually, they are special placeholders (as defined in...
Yes, we do not use text guidance for the OV-LVIS setting by default, because we observe text guidance has little impact in this case. As discussed in the paper, COCO-Caption...
Actually, I only got 2.8M images for the CC3M dataset. But the LVIS APr metric typically has high variance. Maybe you can have another try to see if the results...
Hi @kinredon, I cannot clearly remember the number, it roughly takes 8 V100s to train for 3-5 days. The transfer experiments are based on R50 backbone.