staticSearch icon indicating copy to clipboard operation
staticSearch copied to clipboard

Tokenizer treats an alphabetic character as a word-delimiter

Open martindholmes opened this issue 10 months ago • 6 comments

The codepoint U+A78F:

https://util.unicode.org/UnicodeJsps/character.jsp?a=A78F

(LATIN LETTER SINOLOGICAL DOT) is in the Latin Extended D block, and is Alphabetic, and in the Other_Letter category; Wikipedia explains "A middot may be used as a consonant or modifier letter, rather than as punctuation, in transcription systems and in language orthographies. For such uses Unicode provides the code point U+A78F ꞏ LATIN LETTER SINOLOGICAL DOT.[16]".

It's being proposed for use in this way (as a consonant to signal length) in Wendat orthography. However, our tokenizer currently treats it as a word-break character; I think this is a bug. It could be a bug in the regex in the tokenizer, or in the Java Unicode regex handling; the character is new enough in Unicode (2015) that the problem could just be that the code hasn't caught up. If so, I think we should special-case it.

martindholmes avatar May 01 '24 03:05 martindholmes

This seems to be a bug in Saxon or Java, because both of these test false:

matches('ꞏ', '\p{L}') matches('ꞏ', '\p{L}')

I think the best thing to do for now is to add this character explicitly to the regex for alphanumerics.

martindholmes avatar May 06 '24 15:05 martindholmes

Fix and test for it committed in branch iss-300-sindot. PR #301 created.

martindholmes avatar May 06 '24 16:05 martindholmes

Martin Honnen pointed me at the Saxon documentation which says that it's still using Unicode 6 tables:

https://www.saxonica.com/html/documentation12/conformance/xpath31.html

So that would explain it, if the documentation is up to date.

martindholmes avatar May 06 '24 19:05 martindholmes

I think this issue is complete, but only through the ad-hoc hack of adding the specific character concerned into the regex. Somehow or other, we should keep this around to remind ourselves that when Saxon 12.5 comes out, we need to move to it, and remove the hack.

martindholmes avatar May 21 '24 22:05 martindholmes

Note: Saxon 12.5 was released in July, so I'll add a ticket for upgrading to it, and link it to this ticket. If the upgrade goes smoothly we should be able to test the removal of this hack.

martindholmes avatar Sep 05 '24 22:09 martindholmes

Saxon 12.5 now merged, so this can be tested and the hack removed if no longer required.

martindholmes avatar Sep 23 '24 23:09 martindholmes

Tested this today, and unfortunately the hack is still required. I'm waiting to find out if there is an ETA for a release of Saxon with updated Unicode data.

martindholmes avatar Oct 07 '24 22:10 martindholmes

The answer is possibly 12.6, but definitely by 13.0. When that happens, we can test and update.

martindholmes avatar Oct 23 '24 03:10 martindholmes