difference between paper and implementation in gradcam calculation

Open dengmengjie opened this issue 10 months ago • 0 comments

Hi, thank you for your wonderful work.

I've noticed that in the paper, the relevance score between image patches and tokens are calculated as: where the postive values of gradients are set to 0 through the min function, leaving only negative values. The reason for doing that can be quoted as:

Inspired by GradCAM, we filter out uninformative attention scores by multiplication with the gradient which could cause an increase in the image-text similarity.

But in your code implementation, a clamp(0) function is applied to gradients that is supposed to assign 0 to negative values. Isn't it actually a max function instead of min? grads = ( grads[:, :, :, 1:].clamp(0).reshape(visual_input.size(0), 12, -1, 24, 24) * mask )

Could anyone provide a explaination? Thanks a lot!

Feb 25 '25 09:02 dengmengjie