Fix line breaking for Khmer text (issue #7218)
Problem
Spaces in Khmer text create isolated segments with breaks before AND after each space:
- Input:
"អស់ នឹង មាន" - Output:
['អស់', ' ', 'នឹង', ' ', 'មាន']❌ (isolated spaces)
Solution Overview
Fix two issues that prevented proper space handling:
- language.rs: Spaces were split into separate chunks → complex segmenter never saw them
- line.rs: Complex segmenter wasn't triggered for SA×SPACE×SA sequences
Changes
components/segmenter/src/complex/language.rs
What changed: Don't split text on whitespace characters
Lines ~62 & ~105: Modified both UTF-8 and UTF-16 iterators to skip whitespace when checking for language changes
Effect: Khmer phrases with spaces stay together as one chunk: "អស់ នឹង" instead of "អស់", " ", "នឹង"
components/segmenter/src/line.rs
What changed: Handle SA×SPACE×SA (complex script + space + complex script) sequences
4 changes:
-
~Line 1070: Add
peek_past_spaces_for_sa()helper- Looks ahead past consecutive spaces to check if SA continues
-
~Line 880: Extend complex breaking trigger
- Changed from: only trigger for
SA × SA - Changed to: trigger for
SA × SAORSA × SPACE × SA
- Changed from: only trigger for
-
~Line 908: Suppress UAX#14 breaks
- Don't break at
SA × SPif SA continues after space(s)
- Don't break at
-
~Lines 1165 & 1198: Include spaces in text collection
- Complex segmenter sees full phrases with spaces:
"អស់ នឹង"
- Complex segmenter sees full phrases with spaces:
Effect: Complex segmenter (LSTM/dictionary) handles the entire SA×SPACE×SA sequence intelligently
Result
- Before:
[0, 9, 10, 19, 20, ...](double breaks) - After:
[0, 9, 19, 29, ...](single breaks) - Spaces properly included with words:
['អស់', ' នឹង', ' មាន']✅
Impact
Fixes line breaking for Khmer. Also possibly Thai, Lao, and Myanmar scripts. Matches ICU4C behavior.
Thanks for the contribution!
Please add tests for this behavior. Also, please assert that word break continues to break around spaces, and only line break gets the new behavior.