flair
flair copied to clipboard
[Question]: Subtoken Labeling?
Question
I am working with a Named Entity Recognition (NER) dataset in offset format, where each label is defined by its start_index, end_index, and entity_type. My code converts each label from character-based indices to token-based indices. Then I add the label by using sentence[start_token_index:end_token_index].add_label for each label.
The problem is when I tokenize a sentence, some labels span partially across a token. Token-based indices doesn't work in such cases because the label is only partially a part of a token. Here is an example.
"2008-2009"
2008 is the start_year 2009 is the end_year
The tokenizer might split tokens into the following ["2008-", "2009"]. I can't do sentence[0].add_label("start_year") since the token contains "-". Is there a way to do subtoken labeling or a better way to create flair sentences from offset format.
Hello @quantarb we've had such issues before. In this case, I first use a regular tokenizer, and then additionally split all tokens on the offset positions to get the final tokenization. There is no helper function in Flair for this, so you would need to write your own tokenization code.
Hi @alanakbik , thank you for your quick response. I tried to split up the tokens based on the offset positions, but I'm having problems restructoring my original flair sentence from tokens. What is the best way to reconstruct a flair sentence from tokens.
I tried several different approaches but my new_sentence never matches the original sentence.
text = """ BLAH BLAH BLAH BLAH"""
old_sentence = Sentence(text)
tokens = [Token(token.text) for token in sentence]
new_sentence = Sentence(tokens)
text = """ BLAH BLAH BLAH BLAH"""
old_sentence = Sentence(text)
tokens = [Token(text[token.start_position:token.end_position]) for token in sentence]
new_sentence = Sentence(tokens)
Hi @alanakbik , thank you for your quick response. I tried to split up the tokens based on the offset positions, but I'm having problems restructoring my original flair sentence from tokens. What is the best way to reconstruct a flair sentence from tokens.
I tried several different approaches but my new_sentence never matches the original sentence.
text = """ BLAH BLAH BLAH BLAH""" old_sentence = Sentence(text) tokens = [Token(token.text) for token in sentence] new_sentence = Sentence(tokens)
text = """ BLAH BLAH BLAH BLAH""" old_sentence = Sentence(text) tokens = [Token(text[token.start_position:token.end_position]) for token in sentence] new_sentence = Sentence(tokens)
it might have something to do with the fact that some (all?) tokenizer are lossy, you can try with a different tokenizer:
tokenized = your_tokenizer.tokenize(raw)
#print(tokenized)
sentence = Sentence(tokenized)
tagger.predict(sentence)