Fails to correctly split sentences
As you can see, " Well, maybe cavemen who lived in fear of everything didnt get bored."
is supposed to be a separate sentence
It turns out that when i replaced the character "n" in the word "man" (the last word in the first blue sentence) with any other character apart from "r y j n ", it will correctly split.
This is quite a strange behaviour.
It seems to me that abbreviations are interferring with the core splitting logic.
Good catch!
man is given as an abbreviation in English's abbreviations list. https://github.com/wikimedia/sentencex/blob/master/src/languages/abbrev/en.txt#L107 I am considering a re-review and removing abbreviations that are full words like 'wash', 'man', 'mass'
@santhoshtr I would like to think this problem is more fundamental than simply removing conflicting abbreviations.
- What if this issue occurs in other languages too?
- won't removing the conflicting abbreviations reduce the accuracy of splits?
If a word is a valid full word and an abbreviation at the same time, we will need a completely different strategy to classify them as abbreviation or word. That classification is semantic and can't be done by just string pattern mathing, I am afraid. Because of that we have to consider it as a limitation of this library. Bringing semantic disambiguation often comes with ML techniques and associated computing and performance costs. In my opinion that is a trade off and a consumer of the library should be aware of. We can document this limitation too. However that is exactly what I meant when I wrote "focus on speed and utility" rather than 100% semantic correctness.
Do you think there is alternate solutions to this problem? Please let me know. Thanks