icu4x FYI: Proposals for changes to rules Unicode Standard Annexes 14 and 29

The Properties and Algorithms Group plans to recommend the following proposals to Unicode Technical Committee #‌175 later this month. If they are accepted, the changes would be published as part of Unicode Version 15.1, in September.

UAX #‌14:

L2/23-063, Line breaking around quotation marks.
L2/23-072, Proposed changes for line breaking on orthographic syllables.
- Note that this involves new property values for the Line_Break property.

UAX #‌29:

(No proposal paper, this will be part of L2/23-079.) Upstream the CLDR root tailoring for grapheme clusters, that is, add a new rule GB9c LinkingConsonant ExtCccZwj* Virama ExtCccZwj* × LinkingConsonant, where:
- Virama=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}&\p{Indic_Syllabic_Category=Virama}]
- LinkingConsonant=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}&\p{Indic_Syllabic_Category=Consonant}]
- ExtCccZwj=[\p{gcb=Extend}-\p{ccc=0}] \p{gcb=ZWJ}]

Apr 04 '23 13:04 eggrobin

@makotokato @aethanyc

Apr 04 '23 23:04 sffc

@aethanyc or @makotokato can you take this issue? Probably for 1.x Priority.

Apr 20 '23 18:04 sffc

Discussion: Longer term, we would like it if the upstreamed TOML files would be updated along with the specification, so that ICU4X does not need to do anything more than pulling in updates from upstream.

May 11 '23 18:05 sffc

Looking at the toml files, my impression is that they define a state machine transitioned by code point (that is, a [[tables]] record defines a transition from its left state to its name state when the next code point has the class right), and that the breaks at each step are determined by the [[rules]] with a matching left state, and looking ahead one code point matching the class right.

The following new line breaking rules require more lookahead than that:

× [\p{Pf}&QU] ( SP | GL | WJ | CL | QU | CP | EX | IS | SY | BK | CR | LF | NL | eot)
(AK | ◌ | AS) × (AK | ◌ | AS) VF

These require looking at two code points to the right of the (non-)break, plus any intervening CM (since these are after LB9).

May 17 '23 14:05 eggrobin

Gecko bug

Oct 19 '23 06:10 hsivonen

Henri, this is interesting.

In your comment you correctly identified what LB15a and LB15b are trying to do, and why they need to do that (instead of treating Pi as LB=OP and Pf as LB=CL: that would mess with German, Finnish, etc. usage of Pf initially or Pi finally).

However, these new rules do not help with the Chinese issue at hand, since there are no spaces (there may visually appear to be space, but that is because U+2018 etc. have ambiguous width; here they are wide). This has recently come to the attention of the Properties and Algorithms Group of the UTC; it may be possible to do something about it in the ID QU ID case. I will mention that issue in that discussion. Nothing will happen on that front before Unicode 16.0 in September 2024 though.

Oct 19 '23 15:10 eggrobin

We still need to update line segmenter to Unicode 15.1. @makotokato is working on it.

May 17 '24 21:05 aethanyc

I am experimenting with moving LB8a and LB9 into the code of the line segmenter, as

the combination of these rules makes the state table extraordinarily painful to maintain (and it makes it large), as every state needs to be replicated: X ZWJ is different from X for most X since there is no break after ZWJ per LB8a, but X ZWJ CM brings you back to the X state, so the X ZWJ states cannot be merged;
these rules cannot be tailored (so there is no reason to allow for custom data to change their behaviour), and are in practice reasonably stable: they last changed in Unicode 11 (2018), following up on some earlier Unicode 9 (2016) changes for emoji ZWJ sequences; contrast the other rules that have been changing wildly every year.

Jun 04 '24 12:06 eggrobin

icu4x icu4x copied to clipboard

FYI: Proposals for changes to rules Unicode Standard Annexes 14 and 29

icu4x
icu4x copied to clipboard