recognize-anything icon indicating copy to clipboard operation
recognize-anything copied to clipboard

How to finetune the RAM++ using object detection dataset without image caption data

Open jwwangchn opened this issue 1 year ago • 4 comments

I have an object detection dataset that only contains bbox and class annotations. How can I use this dataset to train RAM++?

jwwangchn avatar Nov 22 '23 07:11 jwwangchn

maybe u can use LLaVa to get the caption of the image in your object detection dataset first, then set the class annotations as the union_tag and use them to generate the tag description, after that u have all the needed information.

chaochen1998 avatar Nov 22 '23 09:11 chaochen1998

The inference of LLaVA is time-consuming and may generate hallucination captions. Compared to LLaVA, I recommend attempting Tag2Text, which is both efficient and detailed captions.

Actually, a simple method is you can directly finetune RAM++ only on image tagging (without image-text alignment) using your class annotations.

xinyu1205 avatar Nov 22 '23 12:11 xinyu1205

@xinyu1205 Thank you for your reply. Directly finetuning the image tagging task is exactly what I want. Should I directly set the loss of the Generation branch to 0? Can I train my dataset by simply running the finetune.py file?

jwwangchn avatar Nov 23 '23 01:11 jwwangchn

Hi, the RAM_plus model have image tagging branch and image-text alignment branch (all with in a shared tagging_head), you can only using the image tagging branch.

xinyu1205 avatar Nov 24 '23 01:11 xinyu1205