flair icon indicating copy to clipboard operation
flair copied to clipboard

[Question]: Subtoken Labeling?

Open quantarb opened this issue 1 year ago • 3 comments

Question

I am working with a Named Entity Recognition (NER) dataset in offset format, where each label is defined by its start_index, end_index, and entity_type. My code converts each label from character-based indices to token-based indices. Then I add the label by using sentence[start_token_index:end_token_index].add_label for each label.

The problem is when I tokenize a sentence, some labels span partially across a token. Token-based indices doesn't work in such cases because the label is only partially a part of a token. Here is an example.

"2008-2009"

2008 is the start_year 2009 is the end_year

The tokenizer might split tokens into the following ["2008-", "2009"]. I can't do sentence[0].add_label("start_year") since the token contains "-". Is there a way to do subtoken labeling or a better way to create flair sentences from offset format.

quantarb avatar Oct 26 '23 04:10 quantarb

Hello @quantarb we've had such issues before. In this case, I first use a regular tokenizer, and then additionally split all tokens on the offset positions to get the final tokenization. There is no helper function in Flair for this, so you would need to write your own tokenization code.

alanakbik avatar Oct 26 '23 12:10 alanakbik

Hi @alanakbik , thank you for your quick response. I tried to split up the tokens based on the offset positions, but I'm having problems restructoring my original flair sentence from tokens. What is the best way to reconstruct a flair sentence from tokens.

I tried several different approaches but my new_sentence never matches the original sentence.

text = """ BLAH BLAH BLAH BLAH"""
old_sentence = Sentence(text)
tokens = [Token(token.text) for token in sentence]
new_sentence = Sentence(tokens)
text = """ BLAH BLAH BLAH BLAH"""
old_sentence = Sentence(text)
tokens = [Token(text[token.start_position:token.end_position]) for token in sentence]
new_sentence = Sentence(tokens)

quantarb avatar Oct 26 '23 21:10 quantarb

Hi @alanakbik , thank you for your quick response. I tried to split up the tokens based on the offset positions, but I'm having problems restructoring my original flair sentence from tokens. What is the best way to reconstruct a flair sentence from tokens.

I tried several different approaches but my new_sentence never matches the original sentence.

text = """ BLAH BLAH BLAH BLAH"""
old_sentence = Sentence(text)
tokens = [Token(token.text) for token in sentence]
new_sentence = Sentence(tokens)
text = """ BLAH BLAH BLAH BLAH"""
old_sentence = Sentence(text)
tokens = [Token(text[token.start_position:token.end_position]) for token in sentence]
new_sentence = Sentence(tokens)

it might have something to do with the fact that some (all?) tokenizer are lossy, you can try with a different tokenizer:

tokenized = your_tokenizer.tokenize(raw)
#print(tokenized)
sentence = Sentence(tokenized)
tagger.predict(sentence)

MostHumble avatar Oct 27 '23 10:10 MostHumble