LibLangly Word Boundary detection

Word Boundary detection

Open Entomy opened this issue 4 years ago • 2 comments

Methods like Words() are supposed to be splitting... words. But they don't. They split on spaces, which isn't necessarily the only boundary. Also, Words() should be removing non word components, but it's not.

In order to do this, a proper implementation of word boundary detection is required. UAX 21.4 describes this.

Mar 29 '20 13:03 Entomy

this and this describe an issue with zwsp along with the debate around it. I've settled on a solution involving keeping the Cf classification instead of Zs, but also ensuring that it is detected as a word boundary. So zwsp (U+200B) absolutely must be recognized that way.

Apr 03 '20 11:04 Entomy

Appologies for the transfer spam. This definately belongs here now.

Sep 14 '20 16:09 Entomy

LibLangly LibLangly copied to clipboard

Word Boundary detection

LibLangly
LibLangly copied to clipboard