flair
flair copied to clipboard
All settings the same, but using "make_label_dictionary" and "make_tag_dictionary" to create "tag_dictionary" for NER model give completely different results.
All other settings are the same (same data, same code etc), but using "make_label_dictionary" and "make_tag_dictionary" to create "tag_dictionary" for NER model give completely different results.
I am training a Swedish NER model, I first trained the model using "make_tag_dictionary" and got the results but then noticed that "make_tag_dictionary" will be deprecated and it is recommended to use "make_label_dictionary" so I changed to "make_label_dictionary". B-U-T they output completely different results. The code and output are as below. Is it a bug or is it something I missunderstand? Thanks!
With "make_tag_dictionary":
Results:
- F-score (micro) 0.4605
- F-score (macro) 0.4535
- Accuracy 0.3011
By class: precision recall f1-score support
B 0.5789 0.3894 0.4656 999
I 0.6849 0.3257 0.4415 307
micro avg 0.5978 0.3744 0.4605 1306 macro avg 0.6319 0.3576 0.4535 1306 weighted avg 0.6038 0.3744 0.4599 1306 samples avg 0.3011 0.3011 0.3011 1306
With "make_label_dictionary":
Results:
- F-score (micro) 0.0302
- F-score (macro) 0.2986
- Accuracy 0.0156
By class: precision recall f1-score support
<unk> 0.0000 0.0000 0.0000 0
B 0.4382 0.4965 0.4655 999
I 0.5000 0.3779 0.4304 307
micro avg 0.0156 0.4686 0.0302 1306 macro avg 0.3127 0.2914 0.2986 1306 weighted avg 0.4527 0.4686 0.4573 1306
Hello @lixiu911 thanks for reporting this - this looks to be the same problem as in #2761 reported by @NicolasVautier.
Could one of you provide some minimal sample data + script so we can reproduce this issue? On my datasets it seems to not appear.
Thanks for your reply Alan. How much data do u need? my data is actually commercial. If you test my data, I have to send you some samples not in the public way. (Swedish language, total 26149 sentences. 25% sentences are annotated with some NER. 70%, 20%, 10% for training, validation and test. )
Thanks @lixiu911 even a very small amount of data would be enough if we can reproduce the error with it. Or different public data that shows the error. We just need to somehow reproduce so we can debug.
@alanakbik - Hi Alan.
sample data - I have crafted some for replicating the issue, just need to know how to get it to you
Tag Format - I suspect the issue is the tag format. In my own generated training data, I was inverting the Entity Class with the Token Position e.g. LOC-B, LOC-I. In Flair 10 and earlier this had no impact. But I have noticed Flair 0.11.3 is now inferring the tag set even when some parts do not exist in the data - and thus a more strict assumption of either IOB(1or2) or BIOES. I have not been able to trace it yet in the code, but I suspect you are converting everything to BIOES. For example in training NER Conll-03 from scratch the the "E-" tags do NOT exist, but the Sequence tagger still lists them in the predicted set, e.g. SequenceTagger predicts: Dictionary with 19 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-ORG, B-ORG, E-ORG, I-ORG, S-MISC, B-MISC, E-MISC, I-MISC, <START>, <STOP>
Related Questions:
- Tag Format - is there any performance improvement in using BIOES over IOB2?
- < unk >, < START >, < STOP > tags - now appear by default. What is the relevance in NER when we have 'O'. Should we remove them if we do not have them in our tagged data?
- multi_label & span_labels - default to true, should we be setting them to false?
Hello @i4meta yes our new version of the SequenceTagger will now always infer the full BIO/BIOES tag set for any tag in the training data. Regarding the data: best via mail, or paste a temporary download link. To your questions:
- Some works have shown that BIOES slightly outperforms IOB2 and this is consistent with my observations.
- START and STOP is used only by the CRF layer, the UNK tag is for tags that do not appear in the training data, but we are currently debating taking it out again. We just merged a PR that removes UNK for sequence tagging (#2839) - it could be that this PR already fixes your issue.
- multi_label should only be True if one data point can have more than one label (rare in NER). span_labels is True for NER.
Hi @alanakbik,
Thanks for the answers - some follow-on questions below:
-
START/STOP Tags - You mentioned that the < START > and < STOP > tags are used by the CRF layer only. Should we be adding them? Or is Flair injecting them as needed?
-
Sample Data - the sample data I have is Conll-03 converted to have inverted IOB tags (position after entity type, e.g. LOC-B, LOC-I). It helped me understand my issue, but may not be so useful for you ;) . However, if you confirm you still want it, I'll hit the send button =) .
-
Detected NER Tag Format - Just as food for thought, it might be prudent to log, as part of processing, the Detected NER Tag Format and what it is being converted to . . . at the very least this could help avoid some future tickets by informing the user.
-
BIOES slightly outperforms IOB2 - Can you share any links to works that investigated this?
about 4.: there is this and more generally this
about 1.: this is done by the tagger itself, you don't need to add it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.