bert_document_classification
bert_document_classification copied to clipboard
Micro/Macro F1 score calculation is over-optimistic
Hi!
Thanks for this neat tool. I've found it quite useful, however, I noticed that the micro and macro average F1 scores seem to be calculated incorrectly in the evaluation script, and might be giving overly optimistic numbers:
micro_f1 = f1_score(correct_output.reshape(-1).numpy(), predictions.reshape(-1).numpy(), average='micro')
Jump to code
f1_score
supports "1d array-like, or label indicator array / sparse matrix". As I understand it, the above code flattens the matrix as if it was a multi-class setting, whereas the original matrix should be used in a multi-label setting, just transposed:
micro_f1 = f1_score(correct_output.T.numpy(), predictions.T.numpy(), average='micro')
With the data I'm testing on I got micro F1 of 0.88 with your version, and 0.35 with mine. I was able to verify that the latter is correct with another software.
You may want to check your numbers in the paper as well.
Kind regards, Samuel
Samuel,
Thank you for pointing this out. In the original datasets considered for evaluation, performance was calculated in the multi-class manner. I will be pushing out an update that will address this (give option between the two) alongside fixing another implementation bug in this public code release.
Thanks! Andriy
On Tue, Mar 24, 2020, 5:12 AM Samuel Rönnqvist [email protected] wrote:
Hi!
Thanks for this neat tool. I've found it quite useful, however, I noticed that the micro and macro average F1 scores seem to be calculated incorrectly in the evaluation script, and might be giving overly optimistic numbers:
micro_f1 = f1_score(correct_output.reshape(-1).numpy(), predictions.reshape(-1).numpy(), average='micro') Jump to code https://github.com/AndriyMulyar/bert_document_classification/blob/060e9034a8c41bfb34b8762c8e1612321015c076/bert_document_classification/document_bert.py#L265
f1_score supports "1d array-like, or label indicator array / sparse matrix". As I understand it, the above code flattens the matrix as if it was a multi-class setting, whereas the original matrix should be used in a multi-label setting, just transposed:
micro_f1 = f1_score(correct_output.T.numpy(), predictions.T.numpy(), average='micro')
With the data I'm testing on I got micro F1 of 0.88 with your version, and 0.35 with mine. I was able to verify that the latter is correct with another software.
You may want to check your numbers in the paper https://arxiv.org/pdf/1910.13664.pdf as well.
Kind regards, Samuel
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AndriyMulyar/bert_document_classification/issues/9, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJ4TBVLQ24ICALNC33ENCDRJB2PTANCNFSM4LSPVQ3A .