bert_document_classification Micro/Macro F1 score calculation is over-optimistic

Micro/Macro F1 score calculation is over-optimistic

Open sronnqvist opened this issue 4 years ago • 1 comments

Hi!

Thanks for this neat tool. I've found it quite useful, however, I noticed that the micro and macro average F1 scores seem to be calculated incorrectly in the evaluation script, and might be giving overly optimistic numbers:

micro_f1 = f1_score(correct_output.reshape(-1).numpy(), predictions.reshape(-1).numpy(), average='micro') Jump to code

f1_score supports "1d array-like, or label indicator array / sparse matrix". As I understand it, the above code flattens the matrix as if it was a multi-class setting, whereas the original matrix should be used in a multi-label setting, just transposed:

micro_f1 = f1_score(correct_output.T.numpy(), predictions.T.numpy(), average='micro')

With the data I'm testing on I got micro F1 of 0.88 with your version, and 0.35 with mine. I was able to verify that the latter is correct with another software.

You may want to check your numbers in the paper as well.

Kind regards, Samuel

Mar 24 '20 09:03 sronnqvist

Samuel,

Thank you for pointing this out. In the original datasets considered for evaluation, performance was calculated in the multi-class manner. I will be pushing out an update that will address this (give option between the two) alongside fixing another implementation bug in this public code release.

Thanks! Andriy

On Tue, Mar 24, 2020, 5:12 AM Samuel Rönnqvist [email protected] wrote:

Hi!

Thanks for this neat tool. I've found it quite useful, however, I noticed that the micro and macro average F1 scores seem to be calculated incorrectly in the evaluation script, and might be giving overly optimistic numbers:

micro_f1 = f1_score(correct_output.reshape(-1).numpy(), predictions.reshape(-1).numpy(), average='micro') Jump to code https://github.com/AndriyMulyar/bert_document_classification/blob/060e9034a8c41bfb34b8762c8e1612321015c076/bert_document_classification/document_bert.py#L265

f1_score supports "1d array-like, or label indicator array / sparse matrix". As I understand it, the above code flattens the matrix as if it was a multi-class setting, whereas the original matrix should be used in a multi-label setting, just transposed:

micro_f1 = f1_score(correct_output.T.numpy(), predictions.T.numpy(), average='micro')

With the data I'm testing on I got micro F1 of 0.88 with your version, and 0.35 with mine. I was able to verify that the latter is correct with another software.

You may want to check your numbers in the paper https://arxiv.org/pdf/1910.13664.pdf as well.

Kind regards, Samuel

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AndriyMulyar/bert_document_classification/issues/9, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJ4TBVLQ24ICALNC33ENCDRJB2PTANCNFSM4LSPVQ3A .

Mar 24 '20 15:03 AndriyMulyar

bert_document_classification bert_document_classification copied to clipboard

Micro/Macro F1 score calculation is over-optimistic

bert_document_classification
bert_document_classification copied to clipboard