flair
flair copied to clipboard
[Question]: Extending NER tags of Hunflair
Question
I wanted to fine-tune the Hunflair-gene model and extend the tags in the original model. The Hunflair gene contains the following items - ['
However, when I do "previous_tag_dictionary.span_labels()" gives "AttributeError: 'Dictionary' object has no attribute 'span_labels'"
previous_tagger = SequenceTagger.load("hunflair-gene") previous_tag_dictionary = previous_tagger.label_dictionary previous_tag_dictionary.get_items()
outputs ['<unk>', 'O', 'S-Gene', 'B-Gene', 'I-Gene', 'E-Gene', '<START>', '<STOP>'].
I have my annotated corpus which contains 2 tags - LIG and REC. I have converted them to a column-corpus and created a new tag dictionary from it.
columns = {0: 'text', 1: 'ner'} corpus = ColumnCorpus(config["data_folder"], columns, train_file='train.txt', dev_file='val.txt', test_file="test.txt") new_tag_dictionary = corpus.make_label_dictionary(label_type='ner', add_unk=False) new_tag_dictionary.get_items()
Which outputs
`2024-04-26 16:16:18,169 Dictionary created for label 'ner' with 2 values: LIG (seen 719 times), REC (seen 296 times)
['LIG', 'REC'] ` I want to finetune the hunflair-gene on the new dataset. As per my understanding, I need to create a new tag dictionary. When I try the following
for old_tag in previous_tag_dictionary.get_items(): new_tag_dictionary.add_item(str(old_tag))
print(f"Updated tag dictionary : {new_tag_dictionary}")
it outputs
Updated tag dictionary : Dictionary with 10 tags: LIG, REC,
However, when I do
tagger_new = SequenceTagger( hidden_size=256, embeddings=previous_tagger.embeddings, tag_dictionary=new_tag_dictionary, tag_type='ner', )
it outputs
2024-04-26 16:16:31,545 SequenceTagger predicts: Dictionary with 37 tags: O, S-LIG, B-LIG, E-LIG, I-LIG, S-REC, B-REC, E-REC, I-REC, S-O, B-O, E-O, I-O, S-S-Gene, B-S-Gene, E-S-Gene, I-S-Gene, S-B-Gene, B-B-Gene, E-B-Gene, I-B-Gene, S-I-Gene, B-I-Gene, E-I-Gene, I-I-Gene, S-E-Gene, B-E-Gene, E-E-Gene, I-E-Gene, S-<START>, B-<START>, E-<START>, I-<START>, S-<STOP>, B-<STOP>, E-<STOP>, I-<STOP>
These are too many tags. Any help will me appreciated.