icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-22707 Unicode 16 beta jun04

Open markusicu opened this issue 1 year ago • 14 comments

  • new short aliases ID_Status, ID_Type
  • Unicode 16 beta data as of 2024-jun-04, including
    • https://github.com/unicode-org/cldr/pull/3783
Checklist
  • [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22707
  • [x] Required: The PR title must be prefixed with a JIRA Issue number.
  • [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [x] Required: Each commit message must be prefixed with a JIRA Issue number.
  • [x] Issue accepted (done by Technical Committee after discussion)
  • [x] Tests included, if applicable
  • [x] API docs and/or User Guide docs changed or added, if applicable

ALLOW_MANY_COMMITS=true

markusicu avatar Jun 05 '24 15:06 markusicu

@eggrobin I have the latest Unicode 16 data here. Locally, test pass except for intltest rbbi and intltest idna. I will probably disable the failing idna (UTS46) tests for a while. Can you please update the segmentation code & data as needed?

@echeran FYI

markusicu avatar Jun 05 '24 22:06 markusicu

Locally, test pass except for intltest rbbi and intltest idna. I will probably disable the failing idna (UTS46) tests for a while.

Done. Locally, only intltest rbbi fails now.

markusicu avatar Jun 05 '24 22:06 markusicu

Can you please update the segmentation code & data as needed?

In this branch, or in a separate PR? (As discussed, I will want to do that with several commits, both to separate the proposals and because I want to keep a record of the steps of the LB25 derivation.)

eggrobin avatar Jun 05 '24 22:06 eggrobin

Can you please update the segmentation code & data as needed?

In this branch, or in a separate PR?

This pull request here is set up to allow multiple commits, and when it's done I will rebase-and-merge them, not squash them.

I assume that it would be easiest for you to add commits here directly for segmentation. Otherwise we would need to disable the failing rbbi tests as well before merging this PR into main. It feels like the risk from disabling tests is higher for rbbi than it is for idna.

markusicu avatar Jun 05 '24 23:06 markusicu

Sounds reasonable, I will add commits into this one then.

eggrobin avatar Jun 05 '24 23:06 eggrobin

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • line.txt is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Oh, this is fun: createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 292, column 5

This is the set [$IS & [\p{ea=F}\p{ea=W}\p{ea=H}]] which got emptied by UTC-179-C30:

[179-C30] Consensus: Change the Line_Break assignment of U+FE10 ︐ PRESENTATION FORM FOR VERTICAL COMMA to Close_Punctuation (CL), and that of U+FE13 ︓ PRESENTATION FORM FOR VERTICAL COLON and U+FE14 ︔ PRESENTATION FORM FOR VERTICAL SEMICOLON to Nonstarter (NS), to match their FULLWIDTH counterparts U+FF0C, U+FF1A, and U+FF1B. For Unicode Version 16.0. See document L2/24-064 item 5.7.

The set previously contained exactly these three characters: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7BU15.1%3Alb%3DIS%7D+%26+%5B%5Cp%7BU15.1%3Aea%3DF%7D%5Cp%7BU15.1%3Aea%3DW%7D%5Cp%7BU15.1%3Aea%3DH%7D%5D%5D&g=&i=.

That item in the PAG report reads:

I only spotted that because of extremely obscure interactions between line breaking rules in the optimized ICU implementation.

So I will now have to remove those extremely obscure lines from the rules, a welcome change from my usual routine of adding extremely obscure lines.

eggrobin avatar Jun 21 '24 13:06 eggrobin

Hi @eggrobin, thanks for making progress here! It sounds like this is still WIP, and I see that a number of the CI checks are unhappy. Are you going to consolidate the commits into fewer/chunkier ones?

markusicu avatar Jun 21 '24 17:06 markusicu

It sounds like this is still WIP, and I see that a number of the CI checks are unhappy.

Yes; I have brought in all the work that was already done, but as expected I need to appease the new monkeys. (And some clang warnings, etc.)

Are you going to consolidate the commits into fewer/chunkier ones?

Mostly, no: things have already been consolidated (compare https://github.com/eggrobin/icu/compare/unicode-org%3Aicu%3Amain...uax14-integration). What remains is split by UTC decision, and, e.g., the work on UTC-179-C35 is in turn split into the steps documented in the background section of item 5.15 of the report, plus the post UTC correction; I want to retain these steps in the history of line.txt and friends.

I expect that I will coalesce whatever additional work remains to be done into one or two commits though.

eggrobin avatar Jun 21 '24 17:06 eggrobin

Hi @eggrobin FYI @echeran now has two pending PRs that add support for new properties, which want to go in after this PR here...

markusicu avatar Jun 28 '24 23:06 markusicu

Yes, I somehow got distracted from ICU4[CJ] matters last week and dropped this ball. I intend to get back to this on Monday, please poke me with a sharp stick if I don’t.

eggrobin avatar Jun 28 '24 23:06 eggrobin

Exciting Development: While testing the new monkeys, I came across a string which exposes a bug in my rules for LB19a. Somehow the old monkeys never came up with such a string over days of testing.

This seems completely tractable in ICU, and should not require a change on the UAX14 side, so this is not an all-hands-on-deck emergency. But it is still uncomfortably exciting.

The string in question is ︷ \U00016FF1\u302B⸠ᅛᆅ, where \U00016FF1\u302B are East_Asian_Width=Wide combining marks. That \U00016FF1, lb=CM and ea=W, being after a space, gets treated as lb=AL, but remains ea=W, so LB19a should not apply.

In ICU, LB19a was implemented in a slightly strange way: LB19 was unchanged, and the complement of LB19a is given break rules (this is to avoid having to add a profusion of rules for overlapping context spanning more than two code points). For a lb=CM following a break, the lb=CM-as-AL and ea=W case is handled by the rule

^[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]]                    / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];
^[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] $CM* $CMX          / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];

But in this case, the lb=CM-as-AL does not follow a break, because LB14 applied.

The solution should be to copy the existing rules that end with $CM+ $AL_FOLLOW, namely

$OP $CM* $SP+ $CM+ $AL_FOLLOW?;
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$CAN_CM $CM*  [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^$CM+  [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
  1. once with $CM+ $AL_FOLLOW? replaced by [$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM],
  2. once with that replaced by [$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] $CM* $CMX / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM].

This test case is sufficiently treacherous that it should be added both to rbbitst.txt and to the UCD’s own LineBreakTest.txt.

eggrobin avatar Jul 01 '24 14:07 eggrobin

Notice: the branch changed across the force-push!

  • icu4c/source/data/brkitr/rules/line.txt is different
  • icu4c/source/test/testdata/break_rules/line.txt is now changed in the branch
  • icu4c/source/test/testdata/rbbitst.txt is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • icu4c/source/data/brkitr/rules/line.txt is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu Status report: 70089cd68383daeb611017393708b54a907f17d0 is green (except for clang warnings which I am fixing in the next commit), so if this is blocking too many things you could run with it. It is however wrong, as the old monkeys demonstrate if they run for long enough. It is wrong in a way I understand and have documented in line.txt, and I think I know how to fix that, though it will involve writing some truly disgusting regular expressions.

Also note that so far this PR does not upgrade any of the tailored copies of the line breaking algorithm (which should receive the same changes as the default). I don’t want to do that before I get the changes to the default right.

eggrobin avatar Jul 02 '24 23:07 eggrobin

@markusicu Status report: 70089cd is green (except for clang warnings which I am fixing in the next commit),

Great, thanks! :tada:

so if this is blocking too many things you could run with it. It is however wrong, as the old monkeys demonstrate if they run for long enough.

Given the US holiday and your and Elango's travel schedules, I suggest that we keep this PR open for now. If you have more time to work on it, you can make progress right here. It would be nice if it was still "green" next week. At that point I (and maybe Andy) could look it over for plausibility and code changes, and merge. And then I might try to rebase Elango's InCB PR -- or I might just wait for his return. Separately I could start fixing ICU UTS46 code for 16 once this PR is in.

markusicu avatar Jul 02 '24 23:07 markusicu

Added Andy as a reviewer for the segmentation changes. (incomplete, see comments above and separate email)

markusicu avatar Jul 02 '24 23:07 markusicu

With https://github.com/unicode-org/icu/pull/3028/commits/c96eb89656e806c008ada83cef7b380b84ad6608, the old monkeys would quickly have caught the issue in the rules at https://github.com/unicode-org/icu/pull/3028/commits/6ae4c111843daf671bb4777ed3d6f69374df43d4, e.g.

         C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4559 Break expected but not found at index 165. Parameters to reproduce: @"type=line seed=2687737441 loop=1"
              149 :  |  |  \U0001f599  ID            LB 9 - adjust for combining sequences.    SIDEWAYS WHITE RIGHT POINTING INDEX
              151 :  .  .      \ufe15  EX&eastAsian  LB 13  Don't break before closings.       PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK
              152 :  |  |      \u2e3a  B2            LB 9 - adjust for combining sequences.    TWO-EM DASH
              153 :  .  .      \u200d  ZWJ                                                     ZERO WIDTH JOINER
              154 :  .  .  \U0001343f  CL            LB 8a ZWJ x                               EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE
              156 :  .  .      \u2060  WJ            LB 11  Do not break before or after WORD JOINER and related characters.  WORD JOINER

              157 :  .  .      \ufffc  CB            LB 11  Do not break before or after WORD JOINER and related characters.  OBJECT REPLACEMENT CHARACTER
              158 :  .  .      \u00ab  QU&Pi         LB 19a [^\p{ea=F}\p{ea=W}\p{ea=H}] × QU  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
              159 :  .  .      \u25cc  AL&DOTTEDC.   LB 19 [QU-\p{Pf}] ×                      DOTTED CIRCLE
              160 :  .  .  \U0001e2ef  CM                                                      WANCHO TONE KOINI
              162 :  |  |      \ufe5d  OP&eastAsian  LB 9 - adjust for combining sequences.    SMALL LEFT TORTOISE SHELL BRACKET
              163 :  .  .      \u0020  SP            LB 7  Don't break before spaces or zero-width space.  SPACE

              164 :  .  .      \u302d  CM&eastAsian  LB 14 Don't break after OP SP*            IDEOGRAPHIC ENTERING TONE MARK
          --> 165 :  |  .      \u2e09  QU&Pi         LB 9 - adjust for combining sequences.    LEFT TRANSPOSITION BRACKET
              166 :  .  .      \ufe6a  PO&eastAsian  LB 19 [QU-\p{Pf}] ×                      SMALL PERCENT SIGN
              167 :  .  .      \u25cc  AL&DOTTEDC.   LB 24 no break between prefix and letters or ideographs  DOTTED CIRCLE

              168 :  .  .      \u17db  PR            LB 24 no break between prefix and letters or ideographs  KHMER CURRENCY SYMBOL RIEL

              169 :  .  .      \uda97  SG            LB 24 no break between prefix and letters or ideographs  <lead surrogate-DA97>

              170 :  .  .      \u0bf9  PR            LB 24 no break between prefix and letters or ideographs  TAMIL RUPEE SIGN

              171 :  .  .      \u000a  LF            LB 6  Don't break before hard line breaks  <control-000A>
              172 :  |  |  \U0001f8fd  XX&ExtPicCn   LB 9 - adjust for combining sequences.    <unassigned-1F8FD>
              174 :  .  .      \ufe69  PR&eastAsian  LB 24 no break between prefix and letters or ideographs  SMALL DOLLAR SIGN

              175 :  .  .      \u25cc  AL&DOTTEDC.   LB 24 no break between prefix and letters or ideographs  DOTTED CIRCLE

              176 :  |  |      \uff62  OP&eastAsian  LB 9 - adjust for combining sequences.    HALFWIDTH LEFT CORNER BRACKET
              177 :  .  .  \U0001f3fc  EM            LB 14 Don't break after OP SP*            EMOJI MODIFIER FITZPATRICK TYPE-3

It also catches this issue:

         C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4559 Break expected but not found at index 35. Parameters to reproduce: @"type=line seed=2505005185 loop=1"
               26 :  |  |  \U000193b8  XX            LB 9 - adjust for combining sequences.    <unassigned-193B8>
               28 :  .  .      \u00bf  OP            LB 30 No break in letters, numbers, or ordinary symbols, opening/closing punctuation.  INVERTED QUESTION MARK
               29 :  .  .      \u2029  BK            LB 6  Don't break before hard line breaks  PARAGRAPH SEPARATOR
               30 :  |  |  \U0001f1f8  RI            LB 9 - adjust for combining sequences.    REGIONAL INDICATOR SYMBOL LETTER S
               32 :  |  |      \u27c5  OP            LB 9 - adjust for combining sequences.    LEFT S-SHAPED BAG DELIMITER
               33 :  .  .      \u0020  SP            LB 7  Don't break before spaces or zero-width space.  SPACE

               34 :  .  .      \u3000  BA&eastAsian  LB 14 Don't break after OP SP*            IDEOGRAPHIC SPACE
          -->  35 :  |  .      \u2e09  QU&Pi         LB 9 - adjust for combining sequences.    LEFT TRANSPOSITION BRACKET
               36 :  .  .  \U0001f195  AI&eastAsian  LB 19 [QU-\p{Pf}] ×                      SQUARED NEW
               38 :  .  .      \u0085  NL            LB 6  Don't break before hard line breaks  <control-0085>
               39 :  |  |      \u005d  CP            LB 9 - adjust for combining sequences.    RIGHT SQUARE BRACKET
               40 :  |  |      \u111c  JL            LB 9 - adjust for combining sequences.    HANGUL CHOSEONG MIEUM-PIEUP
               41 :  |  |      \u1b44  VI            LB 9 - adjust for combining sequences.    BALINESE ADEG ADEG

~~I will try to move forward with making the first kind of issue the expected behaviour (that requires a small change to UAX14, namely adding to https://www.unicode.org/reports/tr14/tr14-52.html#LB10 "and had ea=Na"), and to fix the other issue.~~ [Nevermind, let’s go ahead with the current UAX, see below.]

eggrobin avatar Jul 03 '24 18:07 eggrobin

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/rbbitst.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/rbbitst.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/rbbitst.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Status report: I went back to my original approach of using look-ahead break rules.

Thanks to a comment from @aheninger,

Look ahead rules with ambiguous preceding context can lead to weird behavior.

I was able to figure out the issue with the batch of rules I had added in 9782d0d60661b8970848dd6a9c271fe2651da8e4, and to fix the BA situation as well.

This has passed a million random strings with the old monkeys as updated by c96eb89656e806c008ada83cef7b380b84ad6608 (which had immediately found the bugs that were lurking in 6ae4c111843daf671bb4777ed3d6f69374df43d4; the issue here was that ea=W is very rare in \p{lb=CM} (4,5‰) and \p{lb=BA} (3,7‰), so the generator needs to compensate for that in order to put characters with these combinations of properties in interesting contexts).

I will probably do some git history cleanups (drop 70089cd68383daeb611017393708b54a907f17d0, squash d26a3214a16d039ba24459c3c9200988e643d60e into 9782d0d60661b8970848dd6a9c271fe2651da8e4, remove the stray change from 4df566cdd3c965d8210119abb1182d0176b453de, etc.), fix up comments the various line.txt (many still describe rules as being non-UAX #‌14 behaviour, but all has been upstreamed now), and then move on to the tailorings.

~~There should be no need for a last-minute change to UAX #‌14, and the rules currently in this branch should now be a correct implementation of UAX #‌14 version 16.0β.~~

EDIT: Well, not quite yet:

         C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4560 Break expected but not found at index 19. Parameters to reproduce: @"type=line seed=1263010113 loop=1"
               10 :  |  |  \U0001f18e  AI&eastAsian  LB 9 - adjust for combining sequences.    NEGATIVE SQUARED AB
               12 :  .  .      \u20d2  CM                                                      COMBINING LONG VERTICAL LINE OVERLAY
               13 :  |  |      \u1115  JL            LB 9 - adjust for combining sequences.    HANGUL CHOSEONG NIEUN-TIKEUT
               14 :  |  |      \u2329  OP&eastAsian  LB 9 - adjust for combining sequences.    LEFT-POINTING ANGLE BRACKET
               15 :  .  .      \u3099  CM&eastAsian                                            COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
               16 :  .  .      \u0020  SP            LB 7  Don't break before spaces or zero-width space.  SPACE
               17 :  .  .      \u200d  ZWJ           LB 14 Don't break after OP SP*            ZERO WIDTH JOINER
               18 :  .  .      \u3000  BA&eastAsian  LB 8a ZWJ x                               IDEOGRAPHIC SPACE
          -->  19 :  |  .      \u2e20  QU&Pi         LB 9 - adjust for combining sequences.    LEFT VERTICAL BAR WITH QUILL
               20 :  .  .      \uff68  CJ            LB 19 [QU-\p{Pf}] ×                      HALFWIDTH KATAKANA LETTER SMALL I
               21 :  .  .      \u200d  ZWJ                                                     ZERO WIDTH JOINER
               22 :  .  .  \U0001f8e6  XX&ExtPictCn  LB 8a ZWJ x                               <unassigned-1F8E6>
               24 :  .  .      \u309c  NS&eastAsian  LB 21                                     KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
               25 :  |  |      \u1bf3  VF            LB 9 - adjust for combining sequences.    BATAK PANONGONAN
               26 :  |  |      \u60a2  ID&eastAsian  LB 9 - adjust for combining sequences.    CJK UNIFIED IDEOGRAPH-60A2
               27 :  |  |      \u1138  JL            LB 9 - adjust for combining sequences.    HANGUL CHOSEONG SIOS-KHIEUKH

Anyway, more tomorrow.

eggrobin avatar Jul 04 '24 00:07 eggrobin

Aside from the fact that I still have edge cases to deal with, if we want to rely on the old monkeys, perhaps we should feed them more bits; it certainly looks like the LCG has cycled here (over the course of a couple million test strings).

         C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4560 Break expected but not found at index 19. Parameters to reproduce: @"type=line seed=1263010113 loop=1"
               10 :  |  |  \U0001f18e  AI&eastAsian  LB 9 - adjust for combining sequences.    NEGATIVE SQUARED AB
               12 :  .  .      \u20d2  CM                                                      COMBINING LONG VERTICAL LINE OVERLAY
               13 :  |  |      \u1115  JL            LB 9 - adjust for combining sequences.    HANGUL CHOSEONG NIEUN-TIKEUT
               14 :  |  |      \u2329  OP&eastAsian  LB 9 - adjust for combining sequences.    LEFT-POINTING ANGLE BRACKET
               15 :  .  .      \u3099  CM&eastAsian                                            COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
               16 :  .  .      \u0020  SP            LB 7  Don't break before spaces or zero-width space.  SPACE
               17 :  .  .      \u200d  ZWJ           LB 14 Don't break after OP SP*            ZERO WIDTH JOINER
               18 :  .  .      \u3000  BA&eastAsian  LB 8a ZWJ x                               IDEOGRAPHIC SPACE
          -->  19 :  |  .      \u2e20  QU&Pi         LB 9 - adjust for combining sequences.    LEFT VERTICAL BAR WITH QUILL
               20 :  .  .      \uff68  CJ            LB 19 [QU-\p{Pf}] ×                      HALFWIDTH KATAKANA LETTER SMALL I
               21 :  .  .      \u200d  ZWJ                                                     ZERO WIDTH JOINER
               22 :  .  .  \U0001f8e6  XX&ExtPictCn  LB 8a ZWJ x                               <unassigned-1F8E6>
               24 :  .  .      \u309c  NS&eastAsian  LB 21                                     KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
               25 :  |  |      \u1bf3  VF            LB 9 - adjust for combining sequences.    BATAK PANONGONAN
               26 :  |  |      \u60a2  ID&eastAsian  LB 9 - adjust for combining sequences.    CJK UNIFIED IDEOGRAPH-60A2
               27 :  |  |      \u1138  JL            LB 9 - adjust for combining sequences.    HANGUL CHOSEONG SIOS-KHIEUKH
         C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4560 Break expected but not found at index 434. Parameters to reproduce: @"type=line seed=1479585929 loop=1"
              425 :  |  |  \U0001f18e  AI&eastAsian  LB 9 - adjust for combining sequences.    NEGATIVE SQUARED AB
              427 :  .  .      \u20d2  CM                                                      COMBINING LONG VERTICAL LINE OVERLAY
              428 :  |  |      \u1115  JL            LB 9 - adjust for combining sequences.    HANGUL CHOSEONG NIEUN-TIKEUT
              429 :  |  |      \u2329  OP&eastAsian  LB 9 - adjust for combining sequences.    LEFT-POINTING ANGLE BRACKET
              430 :  .  .      \u3099  CM&eastAsian                                            COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
              431 :  .  .      \u0020  SP            LB 7  Don't break before spaces or zero-width space.  SPACE
              432 :  .  .      \u200d  ZWJ           LB 14 Don't break after OP SP*            ZERO WIDTH JOINER
              433 :  .  .      \u3000  BA&eastAsian  LB 8a ZWJ x                               IDEOGRAPHIC SPACE
          --> 434 :  |  .      \u2e20  QU&Pi         LB 9 - adjust for combining sequences.    LEFT VERTICAL BAR WITH QUILL
              435 :  .  .      \uff68  CJ            LB 19 [QU-\p{Pf}] ×                      HALFWIDTH KATAKANA LETTER SMALL I
              436 :  .  .      \u200d  ZWJ                                                     ZERO WIDTH JOINER
              437 :  .  .  \U0001f8e6  XX&ExtPictCn  LB 8a ZWJ x                               <unassigned-1F8E6>
              439 :  .  .      \u309c  NS&eastAsian  LB 21                                     KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
              440 :  |  |      \u1bf3  VF            LB 9 - adjust for combining sequences.    BATAK PANONGONAN
              441 :  |  |      \u60a2  ID&eastAsian  LB 9 - adjust for combining sequences.    CJK UNIFIED IDEOGRAPH-60A2
              442 :  |  |      \u1138  JL            LB 9 - adjust for combining sequences.    HANGUL CHOSEONG SIOS-KHIEUKH

eggrobin avatar Jul 04 '24 01:07 eggrobin

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/rbbitst.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/rbbitst.cpp is different
  • icu4c/source/test/intltest/rbbitst.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Alright, now using ranlux48 (I had intially gone with mt19937_64, but it turns out it is not great statistically, which I do not really care about, and it has a giant state, which is a problem since we want to print the state). The cycle length should be something like 2576, so running overnight tests might now actually do something more than running it for half an hour.

Some example failures with the current rules (the pattern is obvious, fix coming).

         C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4577 Break expected but not found at index 616. Parameters to reproduce: @"type=line engineState=[129582174696517 236031677744484 92106136591357 53087490761213 168397393323393 214824034318205 223771839526961 68791514640069 87858497332517 265571939623894 241053698228835 278905147843064 1 11] loop=1"
              600 :  |  |      \ub23c  H2            LB 9 - adjust for combining sequences.    HANGUL SYLLABLE NWE
              601 :  .  .      \u000d  CR            LB 6  Don't break before hard line breaks  <control-000D>
              602 :  |  |      \uff01  EX&eastAsian  LB 9 - adjust for combining sequences.    FULLWIDTH EXCLAMATION MARK
              603 :  |  |  \U00011f25  AK            LB 9 - adjust for combining sequences.    KAWI LETTER NA
              605 :  .  .      \u200d  ZWJ                                                     ZERO WIDTH JOINER
              606 :  .  .      \u200d  ZWJ                                                     ZERO WIDTH JOINER
              607 :  .  .      \u2039  QU&Pi         LB 8a ZWJ x                               SINGLE LEFT-POINTING ANGLE QUOTATION MARK
              608 :  .  .      \u1b67  ID            LB 19 [QU-\p{Pf}] ×                      BALINESE MUSICAL SYMBOL DAENG
              609 :  .  .      \ufeff  WJ            LB 11  Do not break before or after WORD JOINER and related characters.  ZERO WIDTH NO-BREAK SPACE
              610 :  .  .  \U0001f17e  AI            LB 11  Do not break before or after WORD JOINER and related characters.  NEGATIVE SQUARED LATIN CAPITAL LETTER O
              612 :  .  .      \u005b  OP            LB 30 No break in letters, numbers, or ordinary symbols, opening/closing punctuation.  LEFT SQUARE BRACKET
              613 :  .  .      \u0020  SP            LB 7  Don't break before spaces or zero-width space.  SPACE

              614 :  .  .      \u200d  ZWJ           LB 14 Don't break after OP SP*            ZERO WIDTH JOINER
              615 :  .  .      \u3000  BA&eastAsian  LB 8a ZWJ x                               IDEOGRAPHIC SPACE
          --> 616 :  |  .      \u201b  QU&Pi         LB 9 - adjust for combining sequences.    SINGLE HIGH-REVERSED-9 QUOTATION MARK
              617 :  .  .      \ubb73  H3            LB 19 [QU-\p{Pf}] ×                      HANGUL SYLLABLE MWED
              618 :  |  |      \u2e1c  QU&Pi         LB 9 - adjust for combining sequences.    LEFT LOW PARAPHRASE BRACKET
              619 :  .  .      \uff04  PR&eastAsian  LB 19 [QU-\p{Pf}] ×                      FULLWIDTH DOLLAR SIGN
              620 :  .  .  \U0001f575  EB            LB 23a                                    SLEUTH OR SPY
              622 :  |  |      \ud930  SG            LB 9 - adjust for combining sequences.    <lead surrogate-D930>
              623 :  .  .  \U00016fe4  GL&eastAsian  LB 12a  [^SP BA HY] x GL                  KHITAN SMALL SCRIPT FILLER
         C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4577 Break expected but not found at index 369. Parameters to reproduce: @"type=line engineState=[77166811820670 237509173405882 241096858294069 143026002418558 83114972300585 115931590404619 117762907704810 175550103338949 222514071576148 25341366417220 127238580822311 22064649469508 0 3] loop=1"
              353 :  |  |  \U00016118  AS            LB 9 - adjust for combining sequences.    GURUNG KHEMA LETTER BHA
              355 :  |  |      \u1117  JL            LB 9 - adjust for combining sequences.    HANGUL CHOSEONG TIKEUT-KIYEOK
              356 :  .  .  \U00011c9b  CM                                                      MARCHEN SUBJOINED LETTER THA
              358 :  .  .      \u2025  IN            LB 22                                     TWO DOT LEADER
              359 :  |  |      \u1bd4  AS            LB 9 - adjust for combining sequences.    BATAK LETTER MA
              360 :  .  .  \U0001f678  QU            LB 19 × [QU-\p{Pi}]                      SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
              362 :  .  .  \U0001f1f5  RI            LB 19 [QU-\p{Pf}] ×                      REGIONAL INDICATOR SYMBOL LETTER P
              364 :  .  .      \u2039  QU&Pi         LB 19a [^\p{ea=F}\p{ea=W}\p{ea=H}] × QU  SINGLE LEFT-POINTING ANGLE QUOTATION MARK
              365 :  .  .      \u29da  OP            LB 19 [QU-\p{Pf}] ×                      LEFT DOUBLE WIGGLY FENCE
              366 :  .  .      \u0020  SP            LB 7  Don't break before spaces or zero-width space.  SPACE

              367 :  .  .      \u200d  ZWJ           LB 14 Don't break after OP SP*            ZERO WIDTH JOINER
              368 :  .  .      \u3000  BA&eastAsian  LB 8a ZWJ x                               IDEOGRAPHIC SPACE
          --> 369 :  |  .      \u201c  QU&Pi         LB 9 - adjust for combining sequences.    LEFT DOUBLE QUOTATION MARK
              370 :  .  .  \U0001f3ff  EM            LB 19 [QU-\p{Pf}] ×                      EMOJI MODIFIER FITZPATRICK TYPE-6
              372 :  |  |  \U0001f191  AI&eastAsian  LB 9 - adjust for combining sequences.    SQUARED CL
              374 :  .  .      \u2010  BA&HYPHEN     LB 21                                     HYPHEN
              375 :  |  |  \U0001193f  AP            LB 9 - adjust for combining sequences.    DIVES AKURU PREFIXED NASAL SIGN
              377 :  .  .      \u2010  BA&HYPHEN     LB 21                                     HYPHEN
         C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4577 Break expected but not found at index 188. Parameters to reproduce: @"type=line engineState=[39250322652124 26602965664501 44462335292463 257996093124684 217060184538580 59496091358185 239738342751554 187683887998792 27099803220938 235694106807173 257010418855911 52161473516717 0 4] loop=1"
              181 :  |  |  \U000ac124  XX            LB 9 - adjust for combining sequences.    <unassigned-AC124>
              183 :  |  |      \ud7f2  JT            LB 9 - adjust for combining sequences.    HANGUL JONGSEONG SIOS-HIEUH
              184 :  |  |      \u298f  OP            LB 9 - adjust for combining sequences.    LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
              185 :  .  .      \u0020  SP            LB 7  Don't break before spaces or zero-width space.  SPACE

              186 :  .  .      \u3099  CM&eastAsian  LB 14 Don't break after OP SP*            COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
              187 :  .  .      \u3000  BA&eastAsian  LB 21                                     IDEOGRAPHIC SPACE
          --> 188 :  |  .      \u2039  QU&Pi         LB 9 - adjust for combining sequences.    SINGLE LEFT-POINTING ANGLE QUOTATION MARK
              189 :  .  .      \uffe6  PR&eastAsian  LB 19 [QU-\p{Pf}] ×                      FULLWIDTH WON SIGN
              190 :  .  .      \u11c9  JT            LB 27 Treat a Korean Syllable Block the same as ID.  HANGUL JONGSEONG NIEUN-THIEUTH

              191 :  |  |  \U00018bd9  AL&eastAsian  LB 9 - adjust for combining sequences.    KHITAN SMALL SCRIPT CHARACTER-18BD9
              193 :  |  |  \U00011004  AP            LB 9 - adjust for combining sequences.    BRAHMI SIGN UPADHMANIYA
              195 :  |  |      \u1bf2  VF            LB 9 - adjust for combining sequences.    BATAK PANGOLAT
         C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4577 Break expected but not found at index 386. Parameters to reproduce: @"type=line engineState=[172765020618646 156508599385088 72517161029338 279818113522329 96389523728778 67370634251451 32220910484813 18641403109399 231866011256099 3429284783750 94723958454661 116055257862891 1 5] loop=1"
              378 :  |  |  \U0001106d  AS            LB 9 - adjust for combining sequences.    BRAHMI DIGIT SEVEN
              380 :  |  |  \U0001f590  EB            LB 9 - adjust for combining sequences.    RAISED HAND WITH FINGERS SPLAYED
              382 :  |  |      \ufe39  OP&eastAsian  LB 9 - adjust for combining sequences.    PRESENTATION FORM FOR VERTICAL LEFT TORTOISE SHELL BRACKET
              383 :  .  .      \u0020  SP            LB 7  Don't break before spaces or zero-width space.  SPACE

              384 :  .  .      \u200d  ZWJ           LB 14 Don't break after OP SP*            ZERO WIDTH JOINER
              385 :  .  .      \u3000  BA&eastAsian  LB 8a ZWJ x                               IDEOGRAPHIC SPACE
          --> 386 :  |  .      \u2e20  QU&Pi         LB 9 - adjust for combining sequences.    LEFT VERTICAL BAR WITH QUILL
              387 :  .  .      \ufe44  CL&eastAsian  LB 13  Don't break before closings.       PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
              388 :  |  |  \U0001ff6d  ID&ExtPictCn  LB 9 - adjust for combining sequences.    <unassigned-1FF6D>
              390 :  |  |  \U0001f140  AI            LB 9 - adjust for combining sequences.    SQUARED LATIN CAPITAL LETTER Q
              392 :  .  .  \U00011c71  EX            LB 13  Don't break before closings.       MARCHEN MARK SHAD

eggrobin avatar Jul 04 '24 19:07 eggrobin

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/rbbitst.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

I was able to figure out the issue with the batch of rules I had added in 9782d0d60661b8970848dd6a9c271fe2651da8e4, and to fix the BA situation as well.

Correction: I was able to get genbrk to not fail some assertion that appears to be related to ambiguous preceding context. However, in the process, the rules became incorrect.

The coexistence of

($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP [$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] ($CM* $CMX)? / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];
^([\p{Pi} & $QU] $CM* $SP*)+ $SP [$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] ($CM* $CMX)?                                     / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];

caused an assertion failure in genbrk. Changing the first one to

($OP $CM* $SP+ | [$OP [$QU-\p{Pi}] $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP [$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] ($CM* $CMX)? / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];

fixed the genbrk failure, but it causes the algorithm to be incorrect in cases such as the following:

         C:\Users\robin\Projects\Unicode\icu\icu4c\source\test\intltest\rbbitst.cpp:4576 Break expected but not found at index 343. Parameters to reproduce: @"type=line engineState=[2554567808342 84362580899683 40783698935486 126079489133127 158986175474216 225539149301120 12699258960171 171041251938665 247037372527467 81213390993136 124057450842805 82209758453811 1 7] loop=1"
              334 :  |  |      \u2e3b  B2            LB 9 - adjust for combining sequences.    THREE-EM DASH
              335 :  |  |      \u1c47  NU            LB 9 - adjust for combining sequences.    LEPCHA DIGIT SEVEN
              336 :  |  |  \U00011aa0  BB            LB 9 - adjust for combining sequences.    SOYOMBO HEAD MARK WITH MOON AND SUN
              338 :  .  .      \u2e0c  QU&Pi         LB 19a [^\p{ea=F}\p{ea=W}\p{ea=H}] × QU  LEFT RAISED OMISSION BRACKET
              339 :  .  .      \u2e20  QU&Pi         LB 19 [QU-\p{Pf}] ×                      LEFT VERTICAL BAR WITH QUILL
              340 :  .  .      \u0020  SP            LB 7  Don't break before spaces or zero-width space.  SPACE

              341 :  .  .  \U00016ff0  CM&eastAsian  LB 15a (OP | QU | GL) [\p{Pi}&QU] SP* x   VIETNAMESE ALTERNATE READING MARK CA
          --> 343 :  |  .      \u2e04  QU&Pi         LB 9 - adjust for combining sequences.    LEFT DOTTED SUBSTITUTION BRACKET
              344 :  .  .      \ua964  JL            LB 19 [QU-\p{Pf}] ×                      HANGUL CHOSEONG RIEUL-KIYEOK
              345 :  |  |  \U000113d1  AP            LB 9 - adjust for combining sequences.    TULU-TIGALARI REPHA
              347 :  .  .  \U00010af6  IN            LB 22                                     MANICHAEAN PUNCTUATION LINE FILLER
              349 :  .  .      \u27eb  CL            LB 13  Don't break before closings.       MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET
              350 :  .  .      \u2010  BA&HYPHEN     LB 21                                     HYPHEN
              351 :  |  |      \u1bf3  VF            LB 9 - adjust for combining sequences.    BATAK PANONGONAN
              352 :  |  |  \U0001f1f3  RI            LB 9 - adjust for combining sequences.    REGIONAL INDICATOR SYMBOL LETTER N

eggrobin avatar Jul 04 '24 20:07 eggrobin