HateXplain icon indicating copy to clipboard operation
HateXplain copied to clipboard

Attention supervision for multiple heads: average or summation?

Open lucasresck opened this issue 1 year ago • 0 comments

Dear authors,

In the paper, it is said that the final loss of attention supervision is the average of the cross entropy loss of the attention weights in each attention head. However, in https://github.com/hate-alert/HateXplain/blob/01d742279dac941981f53806154481c0e15ee686/Models/bertModels.py#L57 it does not seem to be an average because it is a summation and there is no division.

I am concerned about this detail because of the $\lambda$ hyperparameter. If one is going to implement the loss with an average (as the paper says), $\lambda$ is being divided by the number of heads, e.g., 12, which may impact the reproducibility of the hyperparameters in the paper.

Did I get it right? I would appreciate any clarification on this matter.

Thank you very much! 😊

lucasresck avatar May 25 '23 22:05 lucasresck