javascript
javascript copied to clipboard
SentenceTokenizer incorrectly processes punctuation marks within words
Explanation
The currently used SentenceTokenizer generates wrong results when a punctuation mark such as !
or ?
or .
are used within a word (e.g., in a company name).
Examples
Example 1
The following text (see https://github.com/Yoast/wordpress-seo/issues/13726)
The free App FRITZ!App WLAN helps to find the ideal locations when setting up a repeater.
gets incorrectly parsed into the following sentences
0: "The free App FRITZ!"
1: "App WLAN helps to find the ideal locations when setting up a repeater."
Example 2
The same text as in Example 1 but with a .
instead of the !
The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater.
gets correcty parsed into one sentence
0: "The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater."
Example 3
The same text as in Example 2 but the entire word FRITZ.APP
capitalized
The free App FRITZ.APP WLAN helps to find the ideal locations when setting up a repeater.
gets incorrectly parsed into the following sentences
0: "The free App FRITZ."
1: "APP WLAN helps to find the ideal locations when setting up a repeater."
Why does it happen?
The problem in Example 1 occurs because the SentenceTokenizer splits text on !
, ?
, ;
and ...
without checking if the cut-off part begins as a proper sentence should (e.g., with a space and a capital letter). Here is the rule where this check should take place.
Note that such a check is implemented for the situation when the text is split on a .
. Specifically, the rule checks if the second letter of the cut-off remainder text is a capital letter, or a number, etc.
However, the SentenceTokenizer does not check that the first letter of the remainder text is a space. Which is a reason why the problem in Example 3 occurs.
Things to consider
- A fix for both problems seems to be pretty straight-forward to implement.
- A few users complained about these issues.
- The currently used SentenceTokenizer will not be used in its current form when the tree-based text parser is implemented, because the said tokenizer relies on HTML tags.
- We will still need a variant of a sentence tokenizer to be able to operate with sentences in researches. The work on implementing fixes to the current sentence tokenizer will not necessarily be lost therefore.