ICU-22707 Unicode 16 beta jun04
- new short aliases ID_Status, ID_Type
- Unicode 16 beta data as of 2024-jun-04, including
- https://github.com/unicode-org/cldr/pull/3783
Checklist
- [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22707
- [x] Required: The PR title must be prefixed with a JIRA Issue number.
- [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
- [x] Required: Each commit message must be prefixed with a JIRA Issue number.
- [x] Issue accepted (done by Technical Committee after discussion)
- [x] Tests included, if applicable
- [x] API docs and/or User Guide docs changed or added, if applicable
ALLOW_MANY_COMMITS=true
@eggrobin I have the latest Unicode 16 data here. Locally, test pass except for intltest rbbi and intltest idna. I will probably disable the failing idna (UTS46) tests for a while. Can you please update the segmentation code & data as needed?
@echeran FYI
Locally, test pass except for intltest rbbi and intltest idna. I will probably disable the failing idna (UTS46) tests for a while.
Done. Locally, only intltest rbbi fails now.
Can you please update the segmentation code & data as needed?
In this branch, or in a separate PR? (As discussed, I will want to do that with several commits, both to separate the proposals and because I want to keep a record of the steps of the LB25 derivation.)
Can you please update the segmentation code & data as needed?
In this branch, or in a separate PR?
This pull request here is set up to allow multiple commits, and when it's done I will rebase-and-merge them, not squash them.
I assume that it would be easiest for you to add commits here directly for segmentation. Otherwise we would need to disable the failing rbbi tests as well before merging this PR into main. It feels like the risk from disabling tests is higher for rbbi than it is for idna.
Sounds reasonable, I will add commits into this one then.
Hooray! The files in the branch are the same across the force-push. 😃
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- line.txt is no longer changed in the branch
~ Your Friendly Jira-GitHub PR Checker Bot
Hooray! The files in the branch are the same across the force-push. 😃
~ Your Friendly Jira-GitHub PR Checker Bot
Oh, this is fun:
createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 292, column 5
This is the set [$IS & [\p{ea=F}\p{ea=W}\p{ea=H}]] which got emptied by UTC-179-C30:
[179-C30] Consensus: Change the Line_Break assignment of U+FE10 ︐ PRESENTATION FORM FOR VERTICAL COMMA to Close_Punctuation (CL), and that of U+FE13 ︓ PRESENTATION FORM FOR VERTICAL COLON and U+FE14 ︔ PRESENTATION FORM FOR VERTICAL SEMICOLON to Nonstarter (NS), to match their FULLWIDTH counterparts U+FF0C, U+FF1A, and U+FF1B. For Unicode Version 16.0. See document L2/24-064 item 5.7.
The set previously contained exactly these three characters: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7BU15.1%3Alb%3DIS%7D+%26+%5B%5Cp%7BU15.1%3Aea%3DF%7D%5Cp%7BU15.1%3Aea%3DW%7D%5Cp%7BU15.1%3Aea%3DH%7D%5D%5D&g=&i=.
That item in the PAG report reads:
I only spotted that because of extremely obscure interactions between line breaking rules in the optimized ICU implementation.
So I will now have to remove those extremely obscure lines from the rules, a welcome change from my usual routine of adding extremely obscure lines.
Hi @eggrobin, thanks for making progress here! It sounds like this is still WIP, and I see that a number of the CI checks are unhappy. Are you going to consolidate the commits into fewer/chunkier ones?
It sounds like this is still WIP, and I see that a number of the CI checks are unhappy.
Yes; I have brought in all the work that was already done, but as expected I need to appease the new monkeys. (And some clang warnings, etc.)
Are you going to consolidate the commits into fewer/chunkier ones?
Mostly, no: things have already been consolidated (compare https://github.com/eggrobin/icu/compare/unicode-org%3Aicu%3Amain...uax14-integration). What remains is split by UTC decision, and, e.g., the work on UTC-179-C35 is in turn split into the steps documented in the background section of item 5.15 of the report, plus the post UTC correction; I want to retain these steps in the history of line.txt and friends.
I expect that I will coalesce whatever additional work remains to be done into one or two commits though.
Hi @eggrobin FYI @echeran now has two pending PRs that add support for new properties, which want to go in after this PR here...
Yes, I somehow got distracted from ICU4[CJ] matters last week and dropped this ball. I intend to get back to this on Monday, please poke me with a sharp stick if I don’t.
Exciting Development: While testing the new monkeys, I came across a string which exposes a bug in my rules for LB19a. Somehow the old monkeys never came up with such a string over days of testing.
This seems completely tractable in ICU, and should not require a change on the UAX14 side, so this is not an all-hands-on-deck emergency. But it is still uncomfortably exciting.
The string in question is ︷ \U00016FF1\u302B⸠ᅛᆅ, where \U00016FF1\u302B are East_Asian_Width=Wide combining marks.
That \U00016FF1, lb=CM and ea=W, being after a space, gets treated as lb=AL, but remains ea=W, so LB19a should not apply.
In ICU, LB19a was implemented in a slightly strange way: LB19 was unchanged, and the complement of LB19a is given break rules (this is to avoid having to add a profusion of rules for overlapping context spanning more than two code points). For a lb=CM following a break, the lb=CM-as-AL and ea=W case is handled by the rule
^[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];
^[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] $CM* $CMX / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];
But in this case, the lb=CM-as-AL does not follow a break, because LB14 applied.
The solution should be to copy the existing rules that end with $CM+ $AL_FOLLOW, namely
$OP $CM* $SP+ $CM+ $AL_FOLLOW?;
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
- once with
$CM+ $AL_FOLLOW?replaced by[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM], - once with that replaced by
[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] $CM* $CMX / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM].
This test case is sufficiently treacherous that it should be added both to rbbitst.txt and to the UCD’s own LineBreakTest.txt.
Notice: the branch changed across the force-push!
- icu4c/source/data/brkitr/rules/line.txt is different
- icu4c/source/test/testdata/break_rules/line.txt is now changed in the branch
- icu4c/source/test/testdata/rbbitst.txt is different
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4c/source/data/brkitr/rules/line.txt is different
~ Your Friendly Jira-GitHub PR Checker Bot
@markusicu Status report: 70089cd68383daeb611017393708b54a907f17d0 is green (except for clang warnings which I am fixing in the next commit), so if this is blocking too many things you could run with it. It is however wrong, as the old monkeys demonstrate if they run for long enough. It is wrong in a way I understand and have documented in line.txt, and I think I know how to fix that, though it will involve writing some truly disgusting regular expressions.
Also note that so far this PR does not upgrade any of the tailored copies of the line breaking algorithm (which should receive the same changes as the default). I don’t want to do that before I get the changes to the default right.
@markusicu Status report: 70089cd is green (except for clang warnings which I am fixing in the next commit),
Great, thanks! :tada:
so if this is blocking too many things you could run with it. It is however wrong, as the old monkeys demonstrate if they run for long enough.
Given the US holiday and your and Elango's travel schedules, I suggest that we keep this PR open for now. If you have more time to work on it, you can make progress right here. It would be nice if it was still "green" next week. At that point I (and maybe Andy) could look it over for plausibility and code changes, and merge. And then I might try to rebase Elango's InCB PR -- or I might just wait for his return. Separately I could start fixing ICU UTS46 code for 16 once this PR is in.
Added Andy as a reviewer for the segmentation changes. (incomplete, see comments above and separate email)
With https://github.com/unicode-org/icu/pull/3028/commits/c96eb89656e806c008ada83cef7b380b84ad6608, the old monkeys would quickly have caught the issue in the rules at https://github.com/unicode-org/icu/pull/3028/commits/6ae4c111843daf671bb4777ed3d6f69374df43d4, e.g.
C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4559 Break expected but not found at index 165. Parameters to reproduce: @"type=line seed=2687737441 loop=1"
149 : | | \U0001f599 ID LB 9 - adjust for combining sequences. SIDEWAYS WHITE RIGHT POINTING INDEX
151 : . . \ufe15 EX&eastAsian LB 13 Don't break before closings. PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK
152 : | | \u2e3a B2 LB 9 - adjust for combining sequences. TWO-EM DASH
153 : . . \u200d ZWJ ZERO WIDTH JOINER
154 : . . \U0001343f CL LB 8a ZWJ x EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE
156 : . . \u2060 WJ LB 11 Do not break before or after WORD JOINER and related characters. WORD JOINER
157 : . . \ufffc CB LB 11 Do not break before or after WORD JOINER and related characters. OBJECT REPLACEMENT CHARACTER
158 : . . \u00ab QU&Pi LB 19a [^\p{ea=F}\p{ea=W}\p{ea=H}] × QU LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
159 : . . \u25cc AL&DOTTEDC. LB 19 [QU-\p{Pf}] × DOTTED CIRCLE
160 : . . \U0001e2ef CM WANCHO TONE KOINI
162 : | | \ufe5d OP&eastAsian LB 9 - adjust for combining sequences. SMALL LEFT TORTOISE SHELL BRACKET
163 : . . \u0020 SP LB 7 Don't break before spaces or zero-width space. SPACE
164 : . . \u302d CM&eastAsian LB 14 Don't break after OP SP* IDEOGRAPHIC ENTERING TONE MARK
--> 165 : | . \u2e09 QU&Pi LB 9 - adjust for combining sequences. LEFT TRANSPOSITION BRACKET
166 : . . \ufe6a PO&eastAsian LB 19 [QU-\p{Pf}] × SMALL PERCENT SIGN
167 : . . \u25cc AL&DOTTEDC. LB 24 no break between prefix and letters or ideographs DOTTED CIRCLE
168 : . . \u17db PR LB 24 no break between prefix and letters or ideographs KHMER CURRENCY SYMBOL RIEL
169 : . . \uda97 SG LB 24 no break between prefix and letters or ideographs <lead surrogate-DA97>
170 : . . \u0bf9 PR LB 24 no break between prefix and letters or ideographs TAMIL RUPEE SIGN
171 : . . \u000a LF LB 6 Don't break before hard line breaks <control-000A>
172 : | | \U0001f8fd XX&ExtPicCn LB 9 - adjust for combining sequences. <unassigned-1F8FD>
174 : . . \ufe69 PR&eastAsian LB 24 no break between prefix and letters or ideographs SMALL DOLLAR SIGN
175 : . . \u25cc AL&DOTTEDC. LB 24 no break between prefix and letters or ideographs DOTTED CIRCLE
176 : | | \uff62 OP&eastAsian LB 9 - adjust for combining sequences. HALFWIDTH LEFT CORNER BRACKET
177 : . . \U0001f3fc EM LB 14 Don't break after OP SP* EMOJI MODIFIER FITZPATRICK TYPE-3
It also catches this issue:
C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4559 Break expected but not found at index 35. Parameters to reproduce: @"type=line seed=2505005185 loop=1"
26 : | | \U000193b8 XX LB 9 - adjust for combining sequences. <unassigned-193B8>
28 : . . \u00bf OP LB 30 No break in letters, numbers, or ordinary symbols, opening/closing punctuation. INVERTED QUESTION MARK
29 : . . \u2029 BK LB 6 Don't break before hard line breaks PARAGRAPH SEPARATOR
30 : | | \U0001f1f8 RI LB 9 - adjust for combining sequences. REGIONAL INDICATOR SYMBOL LETTER S
32 : | | \u27c5 OP LB 9 - adjust for combining sequences. LEFT S-SHAPED BAG DELIMITER
33 : . . \u0020 SP LB 7 Don't break before spaces or zero-width space. SPACE
34 : . . \u3000 BA&eastAsian LB 14 Don't break after OP SP* IDEOGRAPHIC SPACE
--> 35 : | . \u2e09 QU&Pi LB 9 - adjust for combining sequences. LEFT TRANSPOSITION BRACKET
36 : . . \U0001f195 AI&eastAsian LB 19 [QU-\p{Pf}] × SQUARED NEW
38 : . . \u0085 NL LB 6 Don't break before hard line breaks <control-0085>
39 : | | \u005d CP LB 9 - adjust for combining sequences. RIGHT SQUARE BRACKET
40 : | | \u111c JL LB 9 - adjust for combining sequences. HANGUL CHOSEONG MIEUM-PIEUP
41 : | | \u1b44 VI LB 9 - adjust for combining sequences. BALINESE ADEG ADEG
~~I will try to move forward with making the first kind of issue the expected behaviour (that requires a small change to UAX14, namely adding to https://www.unicode.org/reports/tr14/tr14-52.html#LB10 "and had ea=Na"), and to fix the other issue.~~ [Nevermind, let’s go ahead with the current UAX, see below.]
Notice: the branch changed across the force-push!
- icu4c/source/test/intltest/rbbitst.cpp is different
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4c/source/test/intltest/rbbitst.cpp is different
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4c/source/test/intltest/rbbitst.cpp is different
~ Your Friendly Jira-GitHub PR Checker Bot
Status report: I went back to my original approach of using look-ahead break rules.
Thanks to a comment from @aheninger,
Look ahead rules with ambiguous preceding context can lead to weird behavior.
I was able to figure out the issue with the batch of rules I had added in 9782d0d60661b8970848dd6a9c271fe2651da8e4, and to fix the BA situation as well.
This has passed a million random strings with the old monkeys as updated by c96eb89656e806c008ada83cef7b380b84ad6608 (which had immediately found the bugs that were lurking in 6ae4c111843daf671bb4777ed3d6f69374df43d4; the issue here was that ea=W is very rare in \p{lb=CM} (4,5‰) and \p{lb=BA} (3,7‰), so the generator needs to compensate for that in order to put characters with these combinations of properties in interesting contexts).
I will probably do some git history cleanups (drop 70089cd68383daeb611017393708b54a907f17d0, squash d26a3214a16d039ba24459c3c9200988e643d60e into 9782d0d60661b8970848dd6a9c271fe2651da8e4, remove the stray change from 4df566cdd3c965d8210119abb1182d0176b453de, etc.), fix up comments the various line.txt (many still describe rules as being non-UAX #14 behaviour, but all has been upstreamed now), and then move on to the tailorings.
~~There should be no need for a last-minute change to UAX #14, and the rules currently in this branch should now be a correct implementation of UAX #14 version 16.0β.~~
EDIT: Well, not quite yet:
C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4560 Break expected but not found at index 19. Parameters to reproduce: @"type=line seed=1263010113 loop=1"
10 : | | \U0001f18e AI&eastAsian LB 9 - adjust for combining sequences. NEGATIVE SQUARED AB
12 : . . \u20d2 CM COMBINING LONG VERTICAL LINE OVERLAY
13 : | | \u1115 JL LB 9 - adjust for combining sequences. HANGUL CHOSEONG NIEUN-TIKEUT
14 : | | \u2329 OP&eastAsian LB 9 - adjust for combining sequences. LEFT-POINTING ANGLE BRACKET
15 : . . \u3099 CM&eastAsian COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
16 : . . \u0020 SP LB 7 Don't break before spaces or zero-width space. SPACE
17 : . . \u200d ZWJ LB 14 Don't break after OP SP* ZERO WIDTH JOINER
18 : . . \u3000 BA&eastAsian LB 8a ZWJ x IDEOGRAPHIC SPACE
--> 19 : | . \u2e20 QU&Pi LB 9 - adjust for combining sequences. LEFT VERTICAL BAR WITH QUILL
20 : . . \uff68 CJ LB 19 [QU-\p{Pf}] × HALFWIDTH KATAKANA LETTER SMALL I
21 : . . \u200d ZWJ ZERO WIDTH JOINER
22 : . . \U0001f8e6 XX&ExtPictCn LB 8a ZWJ x <unassigned-1F8E6>
24 : . . \u309c NS&eastAsian LB 21 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
25 : | | \u1bf3 VF LB 9 - adjust for combining sequences. BATAK PANONGONAN
26 : | | \u60a2 ID&eastAsian LB 9 - adjust for combining sequences. CJK UNIFIED IDEOGRAPH-60A2
27 : | | \u1138 JL LB 9 - adjust for combining sequences. HANGUL CHOSEONG SIOS-KHIEUKH
Anyway, more tomorrow.
Aside from the fact that I still have edge cases to deal with, if we want to rely on the old monkeys, perhaps we should feed them more bits; it certainly looks like the LCG has cycled here (over the course of a couple million test strings).
C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4560 Break expected but not found at index 19. Parameters to reproduce: @"type=line seed=1263010113 loop=1"
10 : | | \U0001f18e AI&eastAsian LB 9 - adjust for combining sequences. NEGATIVE SQUARED AB
12 : . . \u20d2 CM COMBINING LONG VERTICAL LINE OVERLAY
13 : | | \u1115 JL LB 9 - adjust for combining sequences. HANGUL CHOSEONG NIEUN-TIKEUT
14 : | | \u2329 OP&eastAsian LB 9 - adjust for combining sequences. LEFT-POINTING ANGLE BRACKET
15 : . . \u3099 CM&eastAsian COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
16 : . . \u0020 SP LB 7 Don't break before spaces or zero-width space. SPACE
17 : . . \u200d ZWJ LB 14 Don't break after OP SP* ZERO WIDTH JOINER
18 : . . \u3000 BA&eastAsian LB 8a ZWJ x IDEOGRAPHIC SPACE
--> 19 : | . \u2e20 QU&Pi LB 9 - adjust for combining sequences. LEFT VERTICAL BAR WITH QUILL
20 : . . \uff68 CJ LB 19 [QU-\p{Pf}] × HALFWIDTH KATAKANA LETTER SMALL I
21 : . . \u200d ZWJ ZERO WIDTH JOINER
22 : . . \U0001f8e6 XX&ExtPictCn LB 8a ZWJ x <unassigned-1F8E6>
24 : . . \u309c NS&eastAsian LB 21 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
25 : | | \u1bf3 VF LB 9 - adjust for combining sequences. BATAK PANONGONAN
26 : | | \u60a2 ID&eastAsian LB 9 - adjust for combining sequences. CJK UNIFIED IDEOGRAPH-60A2
27 : | | \u1138 JL LB 9 - adjust for combining sequences. HANGUL CHOSEONG SIOS-KHIEUKH
C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4560 Break expected but not found at index 434. Parameters to reproduce: @"type=line seed=1479585929 loop=1"
425 : | | \U0001f18e AI&eastAsian LB 9 - adjust for combining sequences. NEGATIVE SQUARED AB
427 : . . \u20d2 CM COMBINING LONG VERTICAL LINE OVERLAY
428 : | | \u1115 JL LB 9 - adjust for combining sequences. HANGUL CHOSEONG NIEUN-TIKEUT
429 : | | \u2329 OP&eastAsian LB 9 - adjust for combining sequences. LEFT-POINTING ANGLE BRACKET
430 : . . \u3099 CM&eastAsian COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
431 : . . \u0020 SP LB 7 Don't break before spaces or zero-width space. SPACE
432 : . . \u200d ZWJ LB 14 Don't break after OP SP* ZERO WIDTH JOINER
433 : . . \u3000 BA&eastAsian LB 8a ZWJ x IDEOGRAPHIC SPACE
--> 434 : | . \u2e20 QU&Pi LB 9 - adjust for combining sequences. LEFT VERTICAL BAR WITH QUILL
435 : . . \uff68 CJ LB 19 [QU-\p{Pf}] × HALFWIDTH KATAKANA LETTER SMALL I
436 : . . \u200d ZWJ ZERO WIDTH JOINER
437 : . . \U0001f8e6 XX&ExtPictCn LB 8a ZWJ x <unassigned-1F8E6>
439 : . . \u309c NS&eastAsian LB 21 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
440 : | | \u1bf3 VF LB 9 - adjust for combining sequences. BATAK PANONGONAN
441 : | | \u60a2 ID&eastAsian LB 9 - adjust for combining sequences. CJK UNIFIED IDEOGRAPH-60A2
442 : | | \u1138 JL LB 9 - adjust for combining sequences. HANGUL CHOSEONG SIOS-KHIEUKH
Notice: the branch changed across the force-push!
- icu4c/source/test/intltest/rbbitst.cpp is different
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4c/source/test/intltest/rbbitst.cpp is different
- icu4c/source/test/intltest/rbbitst.h is now changed in the branch
~ Your Friendly Jira-GitHub PR Checker Bot
Alright, now using ranlux48 (I had intially gone with mt19937_64, but it turns out it is not great statistically, which I do not really care about, and it has a giant state, which is a problem since we want to print the state). The cycle length should be something like 2576, so running overnight tests might now actually do something more than running it for half an hour.
Some example failures with the current rules (the pattern is obvious, fix coming).
C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4577 Break expected but not found at index 616. Parameters to reproduce: @"type=line engineState=[129582174696517 236031677744484 92106136591357 53087490761213 168397393323393 214824034318205 223771839526961 68791514640069 87858497332517 265571939623894 241053698228835 278905147843064 1 11] loop=1"
600 : | | \ub23c H2 LB 9 - adjust for combining sequences. HANGUL SYLLABLE NWE
601 : . . \u000d CR LB 6 Don't break before hard line breaks <control-000D>
602 : | | \uff01 EX&eastAsian LB 9 - adjust for combining sequences. FULLWIDTH EXCLAMATION MARK
603 : | | \U00011f25 AK LB 9 - adjust for combining sequences. KAWI LETTER NA
605 : . . \u200d ZWJ ZERO WIDTH JOINER
606 : . . \u200d ZWJ ZERO WIDTH JOINER
607 : . . \u2039 QU&Pi LB 8a ZWJ x SINGLE LEFT-POINTING ANGLE QUOTATION MARK
608 : . . \u1b67 ID LB 19 [QU-\p{Pf}] × BALINESE MUSICAL SYMBOL DAENG
609 : . . \ufeff WJ LB 11 Do not break before or after WORD JOINER and related characters. ZERO WIDTH NO-BREAK SPACE
610 : . . \U0001f17e AI LB 11 Do not break before or after WORD JOINER and related characters. NEGATIVE SQUARED LATIN CAPITAL LETTER O
612 : . . \u005b OP LB 30 No break in letters, numbers, or ordinary symbols, opening/closing punctuation. LEFT SQUARE BRACKET
613 : . . \u0020 SP LB 7 Don't break before spaces or zero-width space. SPACE
614 : . . \u200d ZWJ LB 14 Don't break after OP SP* ZERO WIDTH JOINER
615 : . . \u3000 BA&eastAsian LB 8a ZWJ x IDEOGRAPHIC SPACE
--> 616 : | . \u201b QU&Pi LB 9 - adjust for combining sequences. SINGLE HIGH-REVERSED-9 QUOTATION MARK
617 : . . \ubb73 H3 LB 19 [QU-\p{Pf}] × HANGUL SYLLABLE MWED
618 : | | \u2e1c QU&Pi LB 9 - adjust for combining sequences. LEFT LOW PARAPHRASE BRACKET
619 : . . \uff04 PR&eastAsian LB 19 [QU-\p{Pf}] × FULLWIDTH DOLLAR SIGN
620 : . . \U0001f575 EB LB 23a SLEUTH OR SPY
622 : | | \ud930 SG LB 9 - adjust for combining sequences. <lead surrogate-D930>
623 : . . \U00016fe4 GL&eastAsian LB 12a [^SP BA HY] x GL KHITAN SMALL SCRIPT FILLER
C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4577 Break expected but not found at index 369. Parameters to reproduce: @"type=line engineState=[77166811820670 237509173405882 241096858294069 143026002418558 83114972300585 115931590404619 117762907704810 175550103338949 222514071576148 25341366417220 127238580822311 22064649469508 0 3] loop=1"
353 : | | \U00016118 AS LB 9 - adjust for combining sequences. GURUNG KHEMA LETTER BHA
355 : | | \u1117 JL LB 9 - adjust for combining sequences. HANGUL CHOSEONG TIKEUT-KIYEOK
356 : . . \U00011c9b CM MARCHEN SUBJOINED LETTER THA
358 : . . \u2025 IN LB 22 TWO DOT LEADER
359 : | | \u1bd4 AS LB 9 - adjust for combining sequences. BATAK LETTER MA
360 : . . \U0001f678 QU LB 19 × [QU-\p{Pi}] SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
362 : . . \U0001f1f5 RI LB 19 [QU-\p{Pf}] × REGIONAL INDICATOR SYMBOL LETTER P
364 : . . \u2039 QU&Pi LB 19a [^\p{ea=F}\p{ea=W}\p{ea=H}] × QU SINGLE LEFT-POINTING ANGLE QUOTATION MARK
365 : . . \u29da OP LB 19 [QU-\p{Pf}] × LEFT DOUBLE WIGGLY FENCE
366 : . . \u0020 SP LB 7 Don't break before spaces or zero-width space. SPACE
367 : . . \u200d ZWJ LB 14 Don't break after OP SP* ZERO WIDTH JOINER
368 : . . \u3000 BA&eastAsian LB 8a ZWJ x IDEOGRAPHIC SPACE
--> 369 : | . \u201c QU&Pi LB 9 - adjust for combining sequences. LEFT DOUBLE QUOTATION MARK
370 : . . \U0001f3ff EM LB 19 [QU-\p{Pf}] × EMOJI MODIFIER FITZPATRICK TYPE-6
372 : | | \U0001f191 AI&eastAsian LB 9 - adjust for combining sequences. SQUARED CL
374 : . . \u2010 BA&HYPHEN LB 21 HYPHEN
375 : | | \U0001193f AP LB 9 - adjust for combining sequences. DIVES AKURU PREFIXED NASAL SIGN
377 : . . \u2010 BA&HYPHEN LB 21 HYPHEN
C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4577 Break expected but not found at index 188. Parameters to reproduce: @"type=line engineState=[39250322652124 26602965664501 44462335292463 257996093124684 217060184538580 59496091358185 239738342751554 187683887998792 27099803220938 235694106807173 257010418855911 52161473516717 0 4] loop=1"
181 : | | \U000ac124 XX LB 9 - adjust for combining sequences. <unassigned-AC124>
183 : | | \ud7f2 JT LB 9 - adjust for combining sequences. HANGUL JONGSEONG SIOS-HIEUH
184 : | | \u298f OP LB 9 - adjust for combining sequences. LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
185 : . . \u0020 SP LB 7 Don't break before spaces or zero-width space. SPACE
186 : . . \u3099 CM&eastAsian LB 14 Don't break after OP SP* COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
187 : . . \u3000 BA&eastAsian LB 21 IDEOGRAPHIC SPACE
--> 188 : | . \u2039 QU&Pi LB 9 - adjust for combining sequences. SINGLE LEFT-POINTING ANGLE QUOTATION MARK
189 : . . \uffe6 PR&eastAsian LB 19 [QU-\p{Pf}] × FULLWIDTH WON SIGN
190 : . . \u11c9 JT LB 27 Treat a Korean Syllable Block the same as ID. HANGUL JONGSEONG NIEUN-THIEUTH
191 : | | \U00018bd9 AL&eastAsian LB 9 - adjust for combining sequences. KHITAN SMALL SCRIPT CHARACTER-18BD9
193 : | | \U00011004 AP LB 9 - adjust for combining sequences. BRAHMI SIGN UPADHMANIYA
195 : | | \u1bf2 VF LB 9 - adjust for combining sequences. BATAK PANGOLAT
C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4577 Break expected but not found at index 386. Parameters to reproduce: @"type=line engineState=[172765020618646 156508599385088 72517161029338 279818113522329 96389523728778 67370634251451 32220910484813 18641403109399 231866011256099 3429284783750 94723958454661 116055257862891 1 5] loop=1"
378 : | | \U0001106d AS LB 9 - adjust for combining sequences. BRAHMI DIGIT SEVEN
380 : | | \U0001f590 EB LB 9 - adjust for combining sequences. RAISED HAND WITH FINGERS SPLAYED
382 : | | \ufe39 OP&eastAsian LB 9 - adjust for combining sequences. PRESENTATION FORM FOR VERTICAL LEFT TORTOISE SHELL BRACKET
383 : . . \u0020 SP LB 7 Don't break before spaces or zero-width space. SPACE
384 : . . \u200d ZWJ LB 14 Don't break after OP SP* ZERO WIDTH JOINER
385 : . . \u3000 BA&eastAsian LB 8a ZWJ x IDEOGRAPHIC SPACE
--> 386 : | . \u2e20 QU&Pi LB 9 - adjust for combining sequences. LEFT VERTICAL BAR WITH QUILL
387 : . . \ufe44 CL&eastAsian LB 13 Don't break before closings. PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
388 : | | \U0001ff6d ID&ExtPictCn LB 9 - adjust for combining sequences. <unassigned-1FF6D>
390 : | | \U0001f140 AI LB 9 - adjust for combining sequences. SQUARED LATIN CAPITAL LETTER Q
392 : . . \U00011c71 EX LB 13 Don't break before closings. MARCHEN MARK SHAD
Notice: the branch changed across the force-push!
- icu4c/source/test/intltest/rbbitst.cpp is different
~ Your Friendly Jira-GitHub PR Checker Bot
I was able to figure out the issue with the batch of rules I had added in 9782d0d60661b8970848dd6a9c271fe2651da8e4, and to fix the BA situation as well.
Correction: I was able to get genbrk to not fail some assertion that appears to be related to ambiguous preceding context. However, in the process, the rules became incorrect.
The coexistence of
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP [$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] ($CM* $CMX)? / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];
^([\p{Pi} & $QU] $CM* $SP*)+ $SP [$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] ($CM* $CMX)? / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];
caused an assertion failure in genbrk. Changing the first one to
($OP $CM* $SP+ | [$OP [$QU-\p{Pi}] $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP [$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] ($CM* $CMX)? / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];
fixed the genbrk failure, but it causes the algorithm to be incorrect in cases such as the following:
C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4576 Break expected but not found at index 343. Parameters to reproduce: @"type=line engineState=[2554567808342 84362580899683 40783698935486 126079489133127 158986175474216 225539149301120 12699258960171 171041251938665 247037372527467 81213390993136 124057450842805 82209758453811 1 7] loop=1"
334 : | | \u2e3b B2 LB 9 - adjust for combining sequences. THREE-EM DASH
335 : | | \u1c47 NU LB 9 - adjust for combining sequences. LEPCHA DIGIT SEVEN
336 : | | \U00011aa0 BB LB 9 - adjust for combining sequences. SOYOMBO HEAD MARK WITH MOON AND SUN
338 : . . \u2e0c QU&Pi LB 19a [^\p{ea=F}\p{ea=W}\p{ea=H}] × QU LEFT RAISED OMISSION BRACKET
339 : . . \u2e20 QU&Pi LB 19 [QU-\p{Pf}] × LEFT VERTICAL BAR WITH QUILL
340 : . . \u0020 SP LB 7 Don't break before spaces or zero-width space. SPACE
341 : . . \U00016ff0 CM&eastAsian LB 15a (OP | QU | GL) [\p{Pi}&QU] SP* x VIETNAMESE ALTERNATE READING MARK CA
--> 343 : | . \u2e04 QU&Pi LB 9 - adjust for combining sequences. LEFT DOTTED SUBSTITUTION BRACKET
344 : . . \ua964 JL LB 19 [QU-\p{Pf}] × HANGUL CHOSEONG RIEUL-KIYEOK
345 : | | \U000113d1 AP LB 9 - adjust for combining sequences. TULU-TIGALARI REPHA
347 : . . \U00010af6 IN LB 22 MANICHAEAN PUNCTUATION LINE FILLER
349 : . . \u27eb CL LB 13 Don't break before closings. MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET
350 : . . \u2010 BA&HYPHEN LB 21 HYPHEN
351 : | | \u1bf3 VF LB 9 - adjust for combining sequences. BATAK PANONGONAN
352 : | | \U0001f1f3 RI LB 9 - adjust for combining sequences. REGIONAL INDICATOR SYMBOL LETTER N