[Bug]: Exception raised when creating Sentence from text with apostrophes when using SegtokTokenizer
Describe the bug
A ValueError: substring not found exception is raised when trying to create a Sentence from the text "John Oʼneill’s construction site".
The issue originates from SegtokTokenizer.tokenize("John Oʼneill’s construction site") that returns ['John', 'Oʼneill', 'O', 'ʼneill’s', 'construction', 'site'], which does not seem correct.
To Reproduce
from flair.data import Sentence
text = "John Oʼneill’s construction site"
sentence = Sentence(text)
Expected behavior
Creating a sentence object successfully from the text.
Logs and Stack traces
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 4
1 from flair.data import Sentence
3 text = "John Oʼneill’s construction site"
----> 4 sentence = Sentence(text)
File ~/.../.venv/lib/python3.12/site-packages/flair/data.py:868, in Sentence.__init__(self, text, use_tokenizer, language_code, start_position)
866 previous_token: Optional[Token] = None
867 for word in words:
--> 868 word_start_position: int = text.index(word, current_offset)
869 delta_offset: int = word_start_position - current_offset
871 token: Token = Token(text=word, start_position=word_start_position)
ValueError: substring not found
Screenshots
No response
Additional Context
No response
Environment
Versions:
Flair
0.15.0
Pytorch
2.5.1
Transformers
4.40.2
GPU
False
Hello @dropther thanks for reporting this. It seems the error is caused by one of the functions in segtok, the library we use for tokenization:
from segtok.tokenizer import word_tokenizer, split_contractions
text = "John Oʼneill’s construction site"
# this part is ok
tokens = word_tokenizer(text)
print(tokens)
# the error happens here
after_split = split_contractions(tokens)
print(after_split)
If you replace ʼ with ' it works. So a quick workaround for now would be to make this replacement on your text.
For me happens when \r appears in the beginning of the sentence:
s = 'O-\rBEG, sopros\rABD.'
sent = Sentence(s)
Result: ValueError: substring not found
s = 'O-\rBEG, sopros\rABD.'.replace('\r','')
sent = Sentence(s)
Result: OK
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.