flair icon indicating copy to clipboard operation
flair copied to clipboard

[Bug]: Exception raised when creating Sentence from text with apostrophes when using SegtokTokenizer

Open dropther opened this issue 11 months ago • 2 comments

Describe the bug

A ValueError: substring not found exception is raised when trying to create a Sentence from the text "John Oʼneill’s construction site".

The issue originates from SegtokTokenizer.tokenize("John Oʼneill’s construction site") that returns ['John', 'Oʼneill', 'O', 'ʼneill’s', 'construction', 'site'], which does not seem correct.

To Reproduce

from flair.data import Sentence

text = "John Oʼneill’s construction site"
sentence = Sentence(text)

Expected behavior

Creating a sentence object successfully from the text.

Logs and Stack traces

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 4
      1 from flair.data import Sentence
      3 text = "John Oʼneill’s construction site"
----> 4 sentence = Sentence(text)

File ~/.../.venv/lib/python3.12/site-packages/flair/data.py:868, in Sentence.__init__(self, text, use_tokenizer, language_code, start_position)
    866 previous_token: Optional[Token] = None
    867 for word in words:
--> 868     word_start_position: int = text.index(word, current_offset)
    869     delta_offset: int = word_start_position - current_offset
    871     token: Token = Token(text=word, start_position=word_start_position)

ValueError: substring not found

Screenshots

No response

Additional Context

No response

Environment

Versions:

Flair

0.15.0

Pytorch

2.5.1

Transformers

4.40.2

GPU

False

dropther avatar Jan 06 '25 13:01 dropther

Hello @dropther thanks for reporting this. It seems the error is caused by one of the functions in segtok, the library we use for tokenization:

from segtok.tokenizer import word_tokenizer, split_contractions

text = "John Oʼneill’s construction site"

# this part is ok
tokens = word_tokenizer(text)
print(tokens)

# the error happens here
after_split = split_contractions(tokens)
print(after_split)

If you replace ʼ with ' it works. So a quick workaround for now would be to make this replacement on your text.

alanakbik avatar Jan 07 '25 12:01 alanakbik

For me happens when \r appears in the beginning of the sentence:

s = 'O-\rBEG, sopros\rABD.'
sent = Sentence(s)

Result: ValueError: substring not found

s = 'O-\rBEG, sopros\rABD.'.replace('\r','')
sent = Sentence(s)

Result: OK

heukirne avatar Jan 22 '25 19:01 heukirne

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 27 '25 04:06 stale[bot]