icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Fix line breaking for Khmer text (issue #7218)

Open kenton-r opened this issue 1 month ago • 2 comments

Problem

Spaces in Khmer text create isolated segments with breaks before AND after each space:

  • Input: "អស់ នឹង មាន"
  • Output: ['អស់', ' ', 'នឹង', ' ', 'មាន'] ❌ (isolated spaces)

Solution Overview

Fix two issues that prevented proper space handling:

  1. language.rs: Spaces were split into separate chunks → complex segmenter never saw them
  2. line.rs: Complex segmenter wasn't triggered for SA×SPACE×SA sequences

Changes

components/segmenter/src/complex/language.rs

What changed: Don't split text on whitespace characters

Lines ~62 & ~105: Modified both UTF-8 and UTF-16 iterators to skip whitespace when checking for language changes

Effect: Khmer phrases with spaces stay together as one chunk: "អស់ នឹង" instead of "អស់", " ", "នឹង"


components/segmenter/src/line.rs

What changed: Handle SA×SPACE×SA (complex script + space + complex script) sequences

4 changes:

  1. ~Line 1070: Add peek_past_spaces_for_sa() helper

    • Looks ahead past consecutive spaces to check if SA continues
  2. ~Line 880: Extend complex breaking trigger

    • Changed from: only trigger for SA × SA
    • Changed to: trigger for SA × SA OR SA × SPACE × SA
  3. ~Line 908: Suppress UAX#14 breaks

    • Don't break at SA × SP if SA continues after space(s)
  4. ~Lines 1165 & 1198: Include spaces in text collection

    • Complex segmenter sees full phrases with spaces: "អស់ នឹង"

Effect: Complex segmenter (LSTM/dictionary) handles the entire SA×SPACE×SA sequence intelligently


Result

  • Before: [0, 9, 10, 19, 20, ...] (double breaks)
  • After: [0, 9, 19, 29, ...] (single breaks)
  • Spaces properly included with words: ['អស់', ' នឹង', ' មាន']

Impact

Fixes line breaking for Khmer. Also possibly Thai, Lao, and Myanmar scripts. Matches ICU4C behavior.

kenton-r avatar Nov 09 '25 07:11 kenton-r

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Nov 09 '25 07:11 CLAassistant

Thanks for the contribution!

Please add tests for this behavior. Also, please assert that word break continues to break around spaces, and only line break gets the new behavior.

sffc avatar Nov 10 '25 21:11 sffc