scibert
scibert copied to clipboard
Reproducability issue with ChemProt results for relation extraction using SciBERT
Hello, I'm currently in midst of replicating the results for the relation extraction task on the ChemProt dataset using SciBERT but so far have been unsuccessful in achieving the F1 score as mentioned in your paper. Using the hyperparameters as described in your paper, I've been able to get an F1 score of 0.51 (Test Set) as detailed by the results from running the script provided in your codebase. Please advise if any further hyperparameter tuning or other tweaking is required to reproduce the F1 score as mentioned in your paper. Thanks in advance!
Hello, I encountered the same issue with the F1 score. @pg427 Did you fix it ?
@laleye No. I'm still awaiting a reply from the authors about this. I tried hyperparameter tuning but this is the maximum score I've gotten so far.
Sorry for missing this. I believe the issue you're seeing is a metric mismatch. For the Chemprot result, the standard metric is micro-F1 (which is computationally equivalent to accuracy) not macro-F1. In our experiments, our macro-F1 was also around 0.5, while micro-F1 is the number reported in the paper. We reference this in Table 1 caption:
Keeping with past work, we report macro F1 scores for NER (span-level), macro F1 scores for REL and CLS (sentence-level), and macro F1 for PICO (token-level), and micro F1 for ChemProt specifically.
See this screenshot of some experimental results (right-most column is the macro-F1 result you're seeing):
Hope that helps!