Clarification on Precision Error Bars in POPE Experiments

Open lowestbuaaer opened this issue 1 year ago • 1 comments

I noticed that the precision error bars seem significantly large in the POPE experiments in your paper. Could you provide some insight into why the error bars are so pronounced? Thank you for your clarification, and I appreciate your work. @BillChan226

Oct 07 '24 13:10 lowestbuaaer

Hi, thanks for your interest in HALC, and we really appreciate this insightful question!

First, we would like to point out the results in table.2 is the OPOPE (offline POPE) benchmark (instead of the original POPE) that we propose to complement the two limitations of CHAIR and POPE which (1) fail to evaluate the false negatives (CHAIR) (2) reasoning-required or post-correction based method cannot correct from simply answering yes or no (POPE).

However, we would like to point out that OPOPE benchmark has its own issues such as large error bars as you mentioned. While HALC demonstrates a seemingly large deviation, all the other methods demonstrate even larger deviations. Thus instead of attributing it to a mere improvement of HALC, we authors believe this originates from the OPOPE benchmark itself, since rather than being a simple VQA task like POPE which directly queries the targeted object, OPOPE caters specifically to the post-correction decoding methods (e.g. LURE, Woodpecker, HALC) by evaluating the generated caption in an offline, untargeted mannner. And due to this untargeted nature of OPOPE, the false positive and false negative statistics could be somehow unstable (e.g. OPOPE will count an object as hallucination if it is not included in the grountruth object lists given by COCO (which is only a subset of the objects that actually exist in the image), potentially contributing to large error bars). , thus resulting in a large deviation in the precision. Again, while this phenomenon is consistent across all decoding methods, HALC outperforms the other methods with the smallest deviation for both accuracy and precision, which demonstrates its robustness and effectiveness.

Oct 08 '24 06:10 BillChan226