odyssey
odyssey copied to clipboard
Incorrect splitting into sentences
Hi all! I am talking about this regular expression, which is later used to coun sentences:
SENTENCE_REGEX = /[^\.!?\s][^\.!?]*(?:[\.!?](?!['"]?\s|$)[^\.!?]*)*[\.!?]?['"]?(?=\s|$)/
For texts like "Mr. Smith is a doctor" this will give two sentences: ["Mr.", "Smith is a doctor"]
resulting in incorrect readability scores.
Maybe there is a way to improve it and exclude some common titles (such as "Mr" or "Dr") from this regular expression?
I am not very good at using scan
method but if we use split
we can probably use an expression similar to this:
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!|(\."))\s
which is also not at all perfect because it will catch "Mr." and "Dr." but not "Mrs." (still better than nothing ☺ ).