flair icon indicating copy to clipboard operation
flair copied to clipboard

[Bug]: splitter.split() `ValueError: substring not found` for specific character combination

Open TimBMK opened this issue 1 year ago • 4 comments

Describe the bug

I have the following text string producing the ValueError when calling splitter.split() on it:

"RT @gruenethl: #GRÜNundWichtig im Juli-#PlenumTH\n\nAuf dem Programm: \n🌻Grünes #KlimaKonjunkturPrgramm\u2029\n🗄️ Aufbewahrung der Akten der #NSU-Untersuchungs-\u2028ausschüsse\u2029\n🧑‍💻#Digitalisierung in Schulen \n\nMehr Infos & Livestream \n📺 https://t.co/UU3mfRQ77t https://t.co/PYCBL7LUYh"

After some testing (removing emojis etc) I could trace the error to the very specific string "s-\u2028ausschüsse". When this specific combination gets passed to splitter.split(), it taps out for some reason.

To Reproduce

from flair.splitter import SegtokSentenceSplitter

splitter = SegtokSentenceSplitter()

# the first half of the string works:
splitter.split("RT @gruenethl: #GRÜNundWichtig im Juli-#PlenumTH\n\nAuf dem Programm: \nGrünes #KlimaKonjunkturPrgramm\u2029\n Aufbewahrung der Akten der #NSU-Untersuchung")

# this produces the error:
splitter.split("s-\u2028ausschüsse")

# other combinations of this string are fine:
splitter.split("s-\u2028")
splitter.split("\u2028ausschüsse")

Expected behavior

I'm not sure why this specific sring causes the error. It's easy enough to remove it in this one instance, but since I'm processing very large amounts of text, it is somewhat impossible to anticipate other problematic strings beforehand.

Logs and Stack traces

ValueError: substring not found

Screenshots

No response

Additional Context

No response

Environment

Versions:

Flair

0.13.1

Pytorch

2.2.0+cu121

Transformers

4.37.2

GPU

False

TimBMK avatar Feb 05 '24 12:02 TimBMK

hi @TimBMK I take the original tweet as reference: image

and assume that those characters \u2028 LINE SEPARATOR and \u2029 PARAGRAPH SEPARATOR are symbols that are there only for display reasons, but has no semantic meaning and therefore should be ignored in nlp.

You can test my fix on https://github.com/flairNLP/flair/pull/3404

helpmefindaname avatar Feb 09 '24 14:02 helpmefindaname

Awesome, thanks for the quick fix! Yes, I agree, they can absolutely be ignored and my workaround was to simply drop them before running the pipeline. My concern was more that the very specific combination of characters (precisely "s-\u2028ausschüsse", while "s-\u2028" and "\u2028ausschüsse" were fine) broke the splitter. I'm not sure if this may point to a larger, underlying problem, as seperators in itself do not seem to break it. One way or the other, simply dropping the (semantically meaningless) seperators should do the trick!

TimBMK avatar Feb 09 '24 14:02 TimBMK

the algorithm works fine if such symbols are on the start or end of a token but break if it is in the middle of one.

In the example the Sentence is s-\u2028ausschüsse while the SekTokTokenizer removes that symbol and returns ['s-ausschüsse'] as tokens. The index-error then occours as s-ausschüsse is no substring of s-\u2028ausschüsse.

helpmefindaname avatar Feb 09 '24 14:02 helpmefindaname

Ah that makes sense. Thanks for the explanation

TimBMK avatar Feb 09 '24 14:02 TimBMK

I've found a smilar problem. The string "\r" equally seems to cause a ValueError: substring not found. Removing it beforehand fixs the problem.

TimBMK avatar Feb 29 '24 15:02 TimBMK