odyssey icon indicating copy to clipboard operation
odyssey copied to clipboard

Incorrect splitting into sentences

Open antonvasilev52 opened this issue 3 years ago • 0 comments

Hi all! I am talking about this regular expression, which is later used to coun sentences:

SENTENCE_REGEX = /[^\.!?\s][^\.!?]*(?:[\.!?](?!['"]?\s|$)[^\.!?]*)*[\.!?]?['"]?(?=\s|$)/

For texts like "Mr. Smith is a doctor" this will give two sentences: ["Mr.", "Smith is a doctor"] resulting in incorrect readability scores. Maybe there is a way to improve it and exclude some common titles (such as "Mr" or "Dr") from this regular expression?

I am not very good at using scan method but if we use split we can probably use an expression similar to this:

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!|(\."))\s

which is also not at all perfect because it will catch "Mr." and "Dr." but not "Mrs." (still better than nothing ☺ ).

antonvasilev52 avatar May 13 '21 21:05 antonvasilev52