icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Last segment of Thai script is always marked as not word-like

Open anba opened this issue 1 year ago • 3 comments

The last segment of the following strings is always marked as not word-like:

  • ขนบนอก
  • พนักงานนําโคลงเรือสามตัว
  • หมอหุงขาวสวยด
  • หนังสือรวมบทความทางวิชาการในการประชุมสัมมนา

Whereas ICU4C marks the last segment of all four strings as word-like.

CC: @aethanyc and @makotokato

anba avatar Dec 12 '23 13:12 anba

Hi there, do I need to be assigned the issue or can I start working on this?

Harsh1s avatar Mar 01 '24 14:03 Harsh1s

Consider the last part of the Thai script as a separate character. Check defined rules for what makes something "word-like." Ask the community for feedback and test the solution in a controlled environment for confirmation.

hiralkhatik123 avatar Mar 20 '24 13:03 hiralkhatik123

The bug is likely in the interface between the (rule-based) break iterator and the LSTM.

I think anyone can open a pull request to add a test case and fix the bug.

sffc avatar Mar 26 '24 01:03 sffc