Inquiry Regarding CIDEr Scores on Flickr30k Evaluation
I sincerely appreciate your outstanding work! I recently employed the adversarially fine-tuned ViT-L/14 CLIP models ($FARE^4$) that you provided as the vision_encoder_pretrained model and conducted an evaluation (the attack is apgd) on the Flickr30k dataset using llava_eval.sh.
However, I noticed that the reported CIDEr score differs significantly from the results presented in Table 1. This discrepancy has left me somewhat puzzled, and I would greatly appreciate any insights you could provide regarding potential factors that might contribute to this variation.
Looking forward to your response. Thank you for your time and assistance!
Hi,
what score do you get in clean evaluation? And could you share the full output please?
Also note that the CIDEr score that you marked in the image above is before multiplication by 100, so it corresponds to 43.6. It is probably higher than reported in the paper, as you use --attack apgd instead of --attack ensemble. The latter is a much stronger attack.
clean evaluation(--attack none):
In the work, you conduct APGD attacks to evaluate the robustness of large vision-language models with different CLIP-models, and report the CIDEr score for captioning.
However, I noticed that the reported CIDEr score differs significantly from the results presented in Table 1. This discrepancy has left me somewhat puzzled
I think, ensemble does an apgd attack at different precisions as well as some other changes, as mentioned in the experiment section. This is a much stronger attack, therefore the score also drops significantly as reported in Table 1. You can also check the codebase to see the difference between apgd and ensemble attack.
Indeed, --attack ensemble activates the apgd attack pipeline that is described in the screenshot of the paper above. This is a much stronger attack than a single apgd attack (which is activated with --attack apgd).