icu4x
icu4x copied to clipboard
Last segment of Thai script is always marked as not word-like
The last segment of the following strings is always marked as not word-like:
- ขนบนอก
- พนักงานนําโคลงเรือสามตัว
- หมอหุงขาวสวยด
- หนังสือรวมบทความทางวิชาการในการประชุมสัมมนา
Whereas ICU4C marks the last segment of all four strings as word-like.
CC: @aethanyc and @makotokato
Hi there, do I need to be assigned the issue or can I start working on this?
Consider the last part of the Thai script as a separate character. Check defined rules for what makes something "word-like." Ask the community for feedback and test the solution in a controlled environment for confirmation.
The bug is likely in the interface between the (rule-based) break iterator and the LSTM.
I think anyone can open a pull request to add a test case and fix the bug.