icu4x Fix line breaking for Khmer text (issue #7218)

Problem

Spaces in Khmer text create isolated segments with breaks before AND after each space:

Input: "អស់ នឹង មាន"
Output: ['អស់', ' ', 'នឹង', ' ', 'មាន'] ❌ (isolated spaces)

Solution Overview

Fix two issues that prevented proper space handling:

language.rs: Spaces were split into separate chunks → complex segmenter never saw them
line.rs: Complex segmenter wasn't triggered for SA×SPACE×SA sequences

Changes

`components/segmenter/src/complex/language.rs`

What changed: Don't split text on whitespace characters

Lines ~62 & ~105: Modified both UTF-8 and UTF-16 iterators to skip whitespace when checking for language changes

Effect: Khmer phrases with spaces stay together as one chunk: "អស់ នឹង" instead of "អស់", " ", "នឹង"

`components/segmenter/src/line.rs`

What changed: Handle SA×SPACE×SA (complex script + space + complex script) sequences

4 changes:

~Line 1070: Add peek_past_spaces_for_sa() helper
- Looks ahead past consecutive spaces to check if SA continues
~Line 880: Extend complex breaking trigger
- Changed from: only trigger for SA × SA
- Changed to: trigger for SA × SA OR SA × SPACE × SA
~Line 908: Suppress UAX#14 breaks
- Don't break at SA × SP if SA continues after space(s)
~Lines 1165 & 1198: Include spaces in text collection
- Complex segmenter sees full phrases with spaces: "អស់ នឹង"

Effect: Complex segmenter (LSTM/dictionary) handles the entire SA×SPACE×SA sequence intelligently

Result

Before: [0, 9, 10, 19, 20, ...] (double breaks)
After: [0, 9, 19, 29, ...] (single breaks)
Spaces properly included with words: ['អស់', ' នឹង', ' មាន'] ✅

Impact

Fixes line breaking for Khmer. Also possibly Thai, Lao, and Myanmar scripts. Matches ICU4C behavior.

Nov 09 '25 07:11 kenton-r

All committers have signed the CLA.

Nov 09 '25 07:11 CLAassistant

Thanks for the contribution!

Please add tests for this behavior. Also, please assert that word break continues to break around spaces, and only line break gets the new behavior.

Nov 10 '25 21:11 sffc