pySBD icon indicating copy to clipboard operation
pySBD copied to clipboard

Exception when clean=True in search_for_connected_sentences

Open balazik opened this issue 4 years ago • 2 comments

Describe the bug Segmenter will raise "exception: bad escape (end of pattern) at position" when it is initialized with clean=True and it encounters a sentence like "etc.Png,Jpg,.\" (word/token that contains a backslash).

The exception is raised in: module: cleaner.py class: class Cleaner method name: search_for_connected_sentences line:

txt = re.sub(re.escape(word), new_word, txt)

To Reproduce Steps to reproduce the behavior:

# This is a simplified example, the original text contained names so I changed it to img formats
# Word that is a abbreviation with dot followed by upper case letter and backslash
sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\"
sentences = sentencer.segment(txt)

Expected behavior The output should be the same as is, but is should not trow an exception. Workaround to see the output is to escape the backslash.

sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\\\"
sentences = sentencer.segment(txt)

Expected output:

['etc.', 'Png,Jpg,.', '\\']

Possible solution replace txt = re.sub(re.escape(word), new_word, txt) with txt = txt.replace(word, new_word) It avoids all the pitfalls of regular expressions (like escaping), and is generally faster.

Additional context Originally we parse small text files (in Slovak language) without special treatment to form a huge sentenced corpus. The example was specially crafted just to reproduce the behavior for English parser. I know that the backslash combination is rare for English but it happens to occur in Slovak articles when you process vast amounts of text.

balazik avatar Feb 16 '21 13:02 balazik

Additional Case:

Also ran into this in spanish text with the string 1.C\ ... assume it is the same problem:

re.error: bad escape (end of pattern) at position 4

kevmurray avatar May 06 '21 21:05 kevmurray