sentencex Fails to correctly split sentences

As you can see, " Well, maybe cavemen who lived in fear of everything didnt get bored."

is supposed to be a separate sentence

Nov 14 '25 20:11 LordKayBanks

It turns out that when i replaced the character "n" in the word "man" (the last word in the first blue sentence) with any other character apart from "r y j n ", it will correctly split.

This is quite a strange behaviour.

It seems to me that abbreviations are interferring with the core splitting logic.

Nov 14 '25 20:11 LordKayBanks

Good catch!

Nov 15 '25 04:11 santhoshtr

man is given as an abbreviation in English's abbreviations list. https://github.com/wikimedia/sentencex/blob/master/src/languages/abbrev/en.txt#L107 I am considering a re-review and removing abbreviations that are full words like 'wash', 'man', 'mass'

Nov 15 '25 06:11 santhoshtr

@santhoshtr I would like to think this problem is more fundamental than simply removing conflicting abbreviations.

What if this issue occurs in other languages too?
won't removing the conflicting abbreviations reduce the accuracy of splits?

Nov 20 '25 06:11 LordKayBanks

If a word is a valid full word and an abbreviation at the same time, we will need a completely different strategy to classify them as abbreviation or word. That classification is semantic and can't be done by just string pattern mathing, I am afraid. Because of that we have to consider it as a limitation of this library. Bringing semantic disambiguation often comes with ML techniques and associated computing and performance costs. In my opinion that is a trade off and a consumer of the library should be aware of. We can document this limitation too. However that is exactly what I meant when I wrote "focus on speed and utility" rather than 100% semantic correctness.

Do you think there is alternate solutions to this problem? Please let me know. Thanks

Nov 20 '25 07:11 santhoshtr