mimir
mimir copied to clipboard
Types of gradients computed by GradNormAttack
Hello, Thanks for your valuable work on mimir!
If I understand correctly, GradNormAttack is computing the average (across layers) of the gradient norm wrt. the model weights.
https://github.com/iamgroot42/mimir/blob/6c611099c669a5a6fd504e55178ef98733f8bb6d/mimir/attacks/gradnorm.py#L41
But the docstring indicates that the gradients are computed w.r.t. input tokens.
https://github.com/iamgroot42/mimir/blob/6c611099c669a5a6fd504e55178ef98733f8bb6d/mimir/attacks/gradnorm.py#L18
Since the original paper proposes both, I think there are two solutions:
- Simply fixing the docstring and keeping the current implementation
- Or implementing both gradients norms. I guess that computing gradients wrt input tokens would require modifying
Model.get_probabilities()
The results in Appendix C.1 suggest that in certain settings, one gradient type outperforms the other, while in other settings, the reverse is observed.
What do you think?
Hey @Framartin,
The gradnorm attack is under-construction (should have mentioned it somewhere- my bad!). We started working on it thinking it would be a nice addition, so pasted some placeholder code and docstrings (hence the mixup).
Gradients with respect to input tokens would indeed require modification- a good solution could be to fix the docstring for now and add the other as a TODO (we can pick up later when we get the time, but you're more than welcome to submit a PR if you want).
Gradient-norm attacks can be tricky for the very reason you mentioned; apart from this behavior (one may work better than other) the choice of parameters (e.g. which layer's parameters to use) could also have some impact. Perhaps a simple addition strategy (to take gradient norms for both weights and input tokens) could help?
Fixed the docstring and closing this issue for now. We might add a token-based gradient attack in a future version, but please feel free to submit a PR in the meanwhile if you have a working implementation!