flair
flair copied to clipboard
[Bug]: splitter.split() `ValueError: substring not found` for specific character combination
Describe the bug
I have the following text string producing the ValueError when calling splitter.split()
on it:
"RT @gruenethl: #GRÜNundWichtig im Juli-#PlenumTH\n\nAuf dem Programm: \n🌻Grünes #KlimaKonjunkturPrgramm\u2029\n🗄️ Aufbewahrung der Akten der #NSU-Untersuchungs-\u2028ausschüsse\u2029\n🧑💻#Digitalisierung in Schulen \n\nMehr Infos & Livestream \n📺 https://t.co/UU3mfRQ77t https://t.co/PYCBL7LUYh"
After some testing (removing emojis etc) I could trace the error to the very specific string "s-\u2028ausschüsse". When this specific combination gets passed to splitter.split()
, it taps out for some reason.
To Reproduce
from flair.splitter import SegtokSentenceSplitter
splitter = SegtokSentenceSplitter()
# the first half of the string works:
splitter.split("RT @gruenethl: #GRÜNundWichtig im Juli-#PlenumTH\n\nAuf dem Programm: \nGrünes #KlimaKonjunkturPrgramm\u2029\n Aufbewahrung der Akten der #NSU-Untersuchung")
# this produces the error:
splitter.split("s-\u2028ausschüsse")
# other combinations of this string are fine:
splitter.split("s-\u2028")
splitter.split("\u2028ausschüsse")
Expected behavior
I'm not sure why this specific sring causes the error. It's easy enough to remove it in this one instance, but since I'm processing very large amounts of text, it is somewhat impossible to anticipate other problematic strings beforehand.
Logs and Stack traces
ValueError: substring not found
Screenshots
No response
Additional Context
No response
Environment
Versions:
Flair
0.13.1
Pytorch
2.2.0+cu121
Transformers
4.37.2
GPU
False
hi @TimBMK
I take the original tweet as reference:
and assume that those characters \u2028 LINE SEPARATOR
and \u2029 PARAGRAPH SEPARATOR
are symbols that are there only for display reasons, but has no semantic meaning and therefore should be ignored in nlp.
You can test my fix on https://github.com/flairNLP/flair/pull/3404
Awesome, thanks for the quick fix! Yes, I agree, they can absolutely be ignored and my workaround was to simply drop them before running the pipeline. My concern was more that the very specific combination of characters (precisely "s-\u2028ausschüsse", while "s-\u2028" and "\u2028ausschüsse" were fine) broke the splitter. I'm not sure if this may point to a larger, underlying problem, as seperators in itself do not seem to break it. One way or the other, simply dropping the (semantically meaningless) seperators should do the trick!
the algorithm works fine if such symbols are on the start or end of a token but break if it is in the middle of one.
In the example the Sentence is s-\u2028ausschüsse
while the SekTokTokenizer removes that symbol and returns ['s-ausschüsse']
as tokens. The index-error then occours as s-ausschüsse
is no substring of s-\u2028ausschüsse
.
Ah that makes sense. Thanks for the explanation
I've found a smilar problem. The string "\r" equally seems to cause a ValueError: substring not found
. Removing it beforehand fixs the problem.