mimir Types of gradients computed by GradNormAttack

Hello, Thanks for your valuable work on mimir!

If I understand correctly, GradNormAttack is computing the average (across layers) of the gradient norm wrt. the model weights.

https://github.com/iamgroot42/mimir/blob/6c611099c669a5a6fd504e55178ef98733f8bb6d/mimir/attacks/gradnorm.py#L41

But the docstring indicates that the gradients are computed w.r.t. input tokens.

https://github.com/iamgroot42/mimir/blob/6c611099c669a5a6fd504e55178ef98733f8bb6d/mimir/attacks/gradnorm.py#L18

Since the original paper proposes both, I think there are two solutions:

Simply fixing the docstring and keeping the current implementation
Or implementing both gradients norms. I guess that computing gradients wrt input tokens would require modifying Model.get_probabilities()

The results in Appendix C.1 suggest that in certain settings, one gradient type outperforms the other, while in other settings, the reverse is observed.

What do you think?

Apr 30 '24 11:04 Framartin

Hey @Framartin,

The gradnorm attack is under-construction (should have mentioned it somewhere- my bad!). We started working on it thinking it would be a nice addition, so pasted some placeholder code and docstrings (hence the mixup).

Gradients with respect to input tokens would indeed require modification- a good solution could be to fix the docstring for now and add the other as a TODO (we can pick up later when we get the time, but you're more than welcome to submit a PR if you want).

Gradient-norm attacks can be tricky for the very reason you mentioned; apart from this behavior (one may work better than other) the choice of parameters (e.g. which layer's parameters to use) could also have some impact. Perhaps a simple addition strategy (to take gradient norms for both weights and input tokens) could help?

Apr 30 '24 13:04 iamgroot42

Fixed the docstring and closing this issue for now. We might add a token-based gradient attack in a future version, but please feel free to submit a PR in the meanwhile if you have a working implementation!

May 10 '24 18:05 iamgroot42

mimir mimir copied to clipboard

Types of gradients computed by GradNormAttack

mimir
mimir copied to clipboard