flair All settings the same, but using "make_label_dictionary" and "make_tag_dictionary" to create "tag_dictionary" for NER model give completely different results.

All other settings are the same (same data, same code etc), but using "make_label_dictionary" and "make_tag_dictionary" to create "tag_dictionary" for NER model give completely different results.

I am training a Swedish NER model, I first trained the model using "make_tag_dictionary" and got the results but then noticed that "make_tag_dictionary" will be deprecated and it is recommended to use "make_label_dictionary" so I changed to "make_label_dictionary". B-U-T they output completely different results. The code and output are as below. Is it a bug or is it something I missunderstand? Thanks!

With "make_tag_dictionary": 8FC7240F-B2DF-4409-A8B9-D3909612287D

Results:

F-score (micro) 0.4605
F-score (macro) 0.4535
Accuracy 0.3011

By class: precision recall f1-score support

       B     0.5789    0.3894    0.4656       999
       I     0.6849    0.3257    0.4415       307

micro avg 0.5978 0.3744 0.4605 1306 macro avg 0.6319 0.3576 0.4535 1306 weighted avg 0.6038 0.3744 0.4599 1306 samples avg 0.3011 0.3011 0.3011 1306

With "make_label_dictionary": 5D0A5C6E-1370-410D-855C-626032F8BEB6

Results:

F-score (micro) 0.0302
F-score (macro) 0.2986
Accuracy 0.0156

By class: precision recall f1-score support

   <unk>     0.0000    0.0000    0.0000         0
       B     0.4382    0.4965    0.4655       999
       I     0.5000    0.3779    0.4304       307

micro avg 0.0156 0.4686 0.0302 1306 macro avg 0.3127 0.2914 0.2986 1306 weighted avg 0.4527 0.4686 0.4573 1306

May 12 '22 18:05 lixiu911

Hello @lixiu911 thanks for reporting this - this looks to be the same problem as in #2761 reported by @NicolasVautier.

Could one of you provide some minimal sample data + script so we can reproduce this issue? On my datasets it seems to not appear.

May 13 '22 14:05 alanakbik

Thanks for your reply Alan. How much data do u need? my data is actually commercial. If you test my data, I have to send you some samples not in the public way. (Swedish language, total 26149 sentences. 25% sentences are annotated with some NER. 70%, 20%, 10% for training, validation and test. )

May 15 '22 21:05 lixiu911

Thanks @lixiu911 even a very small amount of data would be enough if we can reproduce the error with it. Or different public data that shows the error. We just need to somehow reproduce so we can debug.

May 20 '22 03:05 alanakbik

@alanakbik - Hi Alan.

sample data - I have crafted some for replicating the issue, just need to know how to get it to you

Tag Format - I suspect the issue is the tag format. In my own generated training data, I was inverting the Entity Class with the Token Position e.g. LOC-B, LOC-I. In Flair 10 and earlier this had no impact. But I have noticed Flair 0.11.3 is now inferring the tag set even when some parts do not exist in the data - and thus a more strict assumption of either IOB(1or2) or BIOES. I have not been able to trace it yet in the code, but I suspect you are converting everything to BIOES. For example in training NER Conll-03 from scratch the the "E-" tags do NOT exist, but the Sequence tagger still lists them in the predicted set, e.g. SequenceTagger predicts: Dictionary with 19 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-ORG, B-ORG, E-ORG, I-ORG, S-MISC, B-MISC, E-MISC, I-MISC, <START>, <STOP>

Related Questions:

Tag Format - is there any performance improvement in using BIOES over IOB2?
< unk >, < START >, < STOP > tags - now appear by default. What is the relevance in NER when we have 'O'. Should we remove them if we do not have them in our tagged data?
multi_label & span_labels - default to true, should we be setting them to false?

Jun 22 '22 20:06 None-Such

Hello @i4meta yes our new version of the SequenceTagger will now always infer the full BIO/BIOES tag set for any tag in the training data. Regarding the data: best via mail, or paste a temporary download link. To your questions:

Some works have shown that BIOES slightly outperforms IOB2 and this is consistent with my observations.
START and STOP is used only by the CRF layer, the UNK tag is for tags that do not appear in the training data, but we are currently debating taking it out again. We just merged a PR that removes UNK for sequence tagging (#2839) - it could be that this PR already fixes your issue.
multi_label should only be True if one data point can have more than one label (rare in NER). span_labels is True for NER.

Jun 29 '22 08:06 alanakbik

Hi @alanakbik,

Thanks for the answers - some follow-on questions below:

START/STOP Tags - You mentioned that the < START > and < STOP > tags are used by the CRF layer only. Should we be adding them? Or is Flair injecting them as needed?
Sample Data - the sample data I have is Conll-03 converted to have inverted IOB tags (position after entity type, e.g. LOC-B, LOC-I). It helped me understand my issue, but may not be so useful for you ;) . However, if you confirm you still want it, I'll hit the send button =) .
Detected NER Tag Format - Just as food for thought, it might be prudent to log, as part of processing, the Detected NER Tag Format and what it is being converted to . . . at the very least this could help avoid some future tickets by informing the user.
BIOES slightly outperforms IOB2 - Can you share any links to works that investigated this?

Jun 30 '22 17:06 None-Such

about 4.: there is this and more generally this

about 1.: this is done by the tagger itself, you don't need to add it.

Jun 30 '22 22:06 helpmefindaname

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Oct 29 '22 11:10 stale[bot]

flair flair copied to clipboard

All settings the same, but using "make_label_dictionary" and "make_tag_dictionary" to create "tag_dictionary" for NER model give completely different results.

flair
flair copied to clipboard