recognize-anything
recognize-anything copied to clipboard
How to finetune the RAM++ using object detection dataset without image caption data
I have an object detection dataset that only contains bbox and class annotations. How can I use this dataset to train RAM++?
maybe u can use LLaVa to get the caption of the image in your object detection dataset first, then set the class annotations as the union_tag and use them to generate the tag description, after that u have all the needed information.
The inference of LLaVA is time-consuming and may generate hallucination captions. Compared to LLaVA, I recommend attempting Tag2Text, which is both efficient and detailed captions.
Actually, a simple method is you can directly finetune RAM++ only on image tagging (without image-text alignment) using your class annotations.
@xinyu1205 Thank you for your reply. Directly finetuning the image tagging task is exactly what I want. Should I directly set the loss of the Generation branch to 0? Can I train my dataset by simply running the finetune.py
file?
Hi, the RAM_plus model have image tagging branch and image-text alignment branch (all with in a shared tagging_head), you can only using the image tagging branch.