flair icon indicating copy to clipboard operation
flair copied to clipboard

[Question]: SpanClassifier For Overlapping NER Tasks?

Open zrjohnnyl opened this issue 5 months ago • 2 comments

Question

Is the SpanClassifier the correct model for training a Named Entity Recognition (NER) model with overlapping entities? I trained a SpanClassifier using only NER labels, where some labels overlap within a single column. After one training epoch, the model achieved a micro average F1 score of over 95%. However, when attempting predictions, the model returns null.

I know the SpanClassifier is the replacement for the EntityLinkingModel, but I saw in another thread you need to use a SpanTagger model which inherits from EntityLinkingModel for predicting overlapping entities. Below is the code I used for the SpanClassifer. But I was using SequenceTagger but had to filtered out overlapping entities.

embedding = TransformerWordEmbeddings(model='xlm-roberta-base',
                                       layers="-1",
                                       subtoken_pooling="first",
                                       fine_tune=True,
                                       use_context=True,
                                       )
label_type = 'ner'
label_dict = corpus.make_label_dictionary(label_type=label_type, add_unk=True)
    
spanner = SpanClassifier(
    embeddings=embedding,
    label_dictionary=label_dict,
    label_type=label_type,
)

zrjohnnyl avatar Mar 15 '24 17:03 zrjohnnyl

Hi @zrjohnnyl The SpanClassifier is there to further classify existing spans. It cannot perform NER by itself, but rather is there to further finegrain existing named entities, e.g. classify a person further into politican, musician, actor, ....

When you want to train overlapping NER models, you can consider training a NER model per entity type. Then you can define a person that can overlap with an organization, however this won't solve overlaps between the same entity type (person cannot overlap with person).

helpmefindaname avatar Mar 22 '24 10:03 helpmefindaname

I guess that means SpanTagger will never make it to the main repo. I was hoping to avoid the two model approach because my dataset is quite large and I don't want to duplicate my data twice to train two models. Are you allowed to pass multitask_models into make_multitask_model_and_corpus, because there are others tasks and datasets besides that one.

Can I do something like.

multitask_corpus = Corpus(train=[parse_annotations(annotation) for annotation in train_annotations], dev=...., test=...)
multitask_model = MultitaskModel([model_1, model_2], use_all_tasks=True) 

multitask_model, multicorpus = make_multitask_model_and_corpus([
(multitask_model, multitask_corpus, )
(model_3, corpus_3)
])

zrjohnnyl avatar Apr 04 '24 06:04 zrjohnnyl