sentencex icon indicating copy to clipboard operation
sentencex copied to clipboard

Fails to correctly split sentences

Open LordKayBanks opened this issue 1 month ago • 5 comments

Image

As you can see, " Well, maybe cavemen who lived in fear of everything didnt get bored."

is supposed to be a separate sentence

LordKayBanks avatar Nov 14 '25 20:11 LordKayBanks

It turns out that when i replaced the character "n" in the word "man" (the last word in the first blue sentence) with any other character apart from "r y j n ", it will correctly split.

This is quite a strange behaviour.

It seems to me that abbreviations are interferring with the core splitting logic.

LordKayBanks avatar Nov 14 '25 20:11 LordKayBanks

Good catch!

Image

santhoshtr avatar Nov 15 '25 04:11 santhoshtr

man is given as an abbreviation in English's abbreviations list. https://github.com/wikimedia/sentencex/blob/master/src/languages/abbrev/en.txt#L107 I am considering a re-review and removing abbreviations that are full words like 'wash', 'man', 'mass'

santhoshtr avatar Nov 15 '25 06:11 santhoshtr

@santhoshtr I would like to think this problem is more fundamental than simply removing conflicting abbreviations.

  1. What if this issue occurs in other languages too?
  2. won't removing the conflicting abbreviations reduce the accuracy of splits?

LordKayBanks avatar Nov 20 '25 06:11 LordKayBanks

If a word is a valid full word and an abbreviation at the same time, we will need a completely different strategy to classify them as abbreviation or word. That classification is semantic and can't be done by just string pattern mathing, I am afraid. Because of that we have to consider it as a limitation of this library. Bringing semantic disambiguation often comes with ML techniques and associated computing and performance costs. In my opinion that is a trade off and a consumer of the library should be aware of. We can document this limitation too. However that is exactly what I meant when I wrote "focus on speed and utility" rather than 100% semantic correctness.

Do you think there is alternate solutions to this problem? Please let me know. Thanks

santhoshtr avatar Nov 20 '25 07:11 santhoshtr