javascript SentenceTokenizer incorrectly processes punctuation marks within words

SentenceTokenizer incorrectly processes punctuation marks within words

Open nataliashitova opened this issue 5 years ago • 0 comments

Explanation

The currently used SentenceTokenizer generates wrong results when a punctuation mark such as ! or ? or . are used within a word (e.g., in a company name).

Examples

Example 1

The following text (see https://github.com/Yoast/wordpress-seo/issues/13726)

The free App FRITZ!App WLAN helps to find the ideal locations when setting up a repeater.

gets incorrectly parsed into the following sentences

0: "The free App FRITZ!"
1: "App WLAN helps to find the ideal locations when setting up a repeater."

Example 2

The same text as in Example 1 but with a . instead of the !

The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater.

gets correcty parsed into one sentence

0: "The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater."

Example 3

The same text as in Example 2 but the entire word FRITZ.APP capitalized

The free App FRITZ.APP WLAN helps to find the ideal locations when setting up a repeater.

gets incorrectly parsed into the following sentences

0: "The free App FRITZ."
1: "APP WLAN helps to find the ideal locations when setting up a repeater."

Why does it happen?

The problem in Example 1 occurs because the SentenceTokenizer splits text on !, ?, ; and ... without checking if the cut-off part begins as a proper sentence should (e.g., with a space and a capital letter). Here is the rule where this check should take place.

Note that such a check is implemented for the situation when the text is split on a .. Specifically, the rule checks if the second letter of the cut-off remainder text is a capital letter, or a number, etc. However, the SentenceTokenizer does not check that the first letter of the remainder text is a space. Which is a reason why the problem in Example 3 occurs.

Things to consider

A fix for both problems seems to be pretty straight-forward to implement.
A few users complained about these issues.
The currently used SentenceTokenizer will not be used in its current form when the tree-based text parser is implemented, because the said tokenizer relies on HTML tags.
We will still need a variant of a sentence tokenizer to be able to operate with sentences in researches. The work on implementing fixes to the current sentence tokenizer will not necessarily be lost therefore.

Nov 19 '19 14:11 nataliashitova

javascript javascript copied to clipboard

SentenceTokenizer incorrectly processes punctuation marks within words

Explanation

Examples

Example 1

Example 2

Example 3

Why does it happen?

Things to consider

javascript
javascript copied to clipboard