icu ICU-23038 Unicode 17 beta

Unicode 17 beta WIP

Best reviewed one commit at a time.

Checklist

[x] Required: Issue filed: ICU-23038
[x] Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
[x] Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
[x] Issue accepted (done by Technical Committee after discussion)
[x] Tests included, if applicable
[x] API docs and/or User Guide docs changed or added, if applicable

ALLOW_MANY_COMMITS=meow

May 21 '25 22:05 markusicu

@eggrobin ICU4C properties data is in, so you should be able to start work on C++ RBBI.

On my machine, I currently see RBBI and collation test failures:

| ***     FAILING TEST SUMMARY FOR:              intltest  
         TestUnicodeFiles
         TestExtended
         TestMonkey
      RBBITest
   rbbi

| ***     FAILING TEST SUMMARY FOR:              cintltst  
/tstxtbd/cbiapts/TestBreakIteratorTailoring
/tscoll/capitst/TestProperty

I will work on collation and on Java data next.

May 27 '25 23:05 markusicu

@eggrobin I went through the rest of the update instructions. C++ tests still fail with rbbi, which is probably expected. (Different set of failures from before.) I haven't run Java tests yet locally.

         TestExtended
      RBBITest
         testMonkey
      RBBIMonkeyTest
   rbbi

May 28 '25 23:05 markusicu

After committing 0669e86c429c218db3e81eb077780c5704de427e, I ran the monkeys and forgot about them until earlier today, so this has been tested with 120 million strings. Good enough.

Jun 02 '25 21:06 eggrobin

@markusicu Looks like we are good on the C++ RBBI side. I will do what I can on the Java side, but as usual you will need to update the .brk files.

Jun 16 '25 15:06 eggrobin

@markusicu, your controls.

Jun 16 '25 20:06 eggrobin

@markusicu I am assuming that these Java test have nothing to do with RBBI:

Error:  Failures: 
Error:    PersonNameConsistencyTest.TestPersonNames:107->AbstractTestLog.errln:50 Failure in km.txt: Found 20 errors.
Error:    UCharacterTest.TestGetNumericValue:3090->AbstractTestLog.errln:50 UCharacter.getNumericValue(i) returned a different value from the expected result. Got -2Expected1000000
Error:    UnicodeSetTest.TestToPattern:256->expectToPattern:2606->AbstractTestLog.errln:50 FAIL: toPattern() => "[\u00BD\u0B73\u0D74\u0F2A\u2CFD\uA831\U00010141\U00010175\U00010176\U00010E7B\U00012226]", expected "[\u00BD\u0B73\u0D74\u0F2A\u2CFD\uA831\U00010141\U00010175\U00010176\U00010E7B]"

Jun 16 '25 21:06 eggrobin

I am assuming that these Java test have nothing to do with RBBI

I assume the same.

@richgillam could you please take a look at

Error: PersonNameConsistencyTest.TestPersonNames:107->AbstractTestLog.errln:50 Failure in km.txt: Found 20 errors.

?

Jun 16 '25 21:06 markusicu

@richgillam could you please take a look at

Error: PersonNameConsistencyTest.TestPersonNames:107->AbstractTestLog.errln:50 Failure in km.txt: Found 20 errors.

?

@markusicu @eggrobin Where can I see the logs? Have you been getting that error the whole time, or did it start somewhere in the middle of getting this PR through?

That test compares what ICU is producing with what the CLDR PersonNameFormatter produces and is based on a test file that we copy over from the CLDR side. This might be an indication that there was a data change on the CLDR side that somehow didn't make it over to ICU, or it might mean some data change has exposed an algorithm difference we didn't previously know about. @macchiati do you know of anything changing on the CLDR side that would affect this?

Jun 18 '25 21:06 richgillam

Actually, when we've seen this kind of thing before, the most common explanation is that the test file itself got updated on the CLDR side and somehow failed to get properly copied into the ICU project. It's been a long time since I've looked at this, but I think we keep those files under source control in ICU. I could be wrong, but I think the test files get copied over as part of the CLDR-to-ICU conversion process for all the resource data. So I'm going to guess that test file got changed on the CLDR side and we haven't run a CLDR-to-ICU integration since then (or that it went wrong somehow).

Jun 18 '25 21:06 richgillam

@richgillam could you please take a look at

Error: PersonNameConsistencyTest.TestPersonNames:107->AbstractTestLog.errln:50 Failure in km.txt: Found 20 errors.

?

@markusicu @eggrobin Where can I see the logs?

Follow the link from a failing CI check --> example: https://github.com/unicode-org/icu/actions/runs/15686520462/job/44191107858?pr=3505

Have you been getting that error the whole time, or did it start somewhere in the middle of getting this PR through?

I only noticed it when I looked at the CI checks here.

That test compares what ICU is producing with what the CLDR PersonNameFormatter produces and is based on a test file that we copy over from the CLDR side. This might be an indication that there was a data change on the CLDR side that somehow didn't make it over to ICU, or it might mean some data change has exposed an algorithm difference we didn't previously know about. @macchiati do you know of anything changing on the CLDR side that would affect this?

I am not integrating CLDR changes here, except for script metadata, collation and translit.

Jul 07 '25 23:07 markusicu

@richgillam Locally I turned on VERBOSE_OUTPUT in PersonNameConsistencyTest and got these lines for Khmer:

    Expected 'ស្តូបើ, ហ្. ហេ. មី.', got 'ស្តូបើ, ហ្សា. ហេ. មី.' at line 606
    Expected 'ស្តូបើ ហ្. ហេ. មី.', got 'ស្តូបើ ហ្សា. ហេ. មី.' at line 610
    Expected 'ហ្. ហេ. មី. ស្តូបើ', got 'ហ្សា. ហេ. មី. ស្តូបើ' at line 620
    Expected 'ហ្សាហ្សីលៀ ស្.', got 'ហ្សាហ្សីលៀ ស្តូ.' at line 634
    Expected 'ស្តូបើ ហ្.', got 'ស្តូបើ ហ្សា.' at line 638
    Expected 'ស្ហ្ហេ', got 'ស្តូហ្សាហេ' at line 660
    Expected 'ហ្ហេស្', got 'ហ្សាហេស្តូ' at line 664
    Expected 'ស្ហ្', got 'ស្តូហ្សា' at line 668
    Expected 'ហ្ស្', got 'ហ្សាស្តូ' at line 672
    Expected 'ស្', got 'ស្តូ' at line 676
    Expected 'ស្', got 'ស្តូ' at line 677
    Expected 'ស្', got 'ស្តូ' at line 678
    Expected 'ស្', got 'ស្តូ' at line 679
    Expected 'ហ្', got 'ហ្សា' at line 683
    Expected 'ហ្', got 'ហ្សា' at line 684
    Expected 'ហ្', got 'ហ្សា' at line 685
    Expected 'ហ្', got 'ហ្សា' at line 686
    Expected 'នីឡេ វ. ប្.', got 'នីឡេ វ. ប្រ៊ូ.' at line 769
    Expected 'ប្អាឆេ', got 'ប្រ៊ូអាឆេ' at line 773
    Expected 'អាឆេប្', got 'អាឆេប្រ៊ូ' at line 777

Reformatting the first few:

    Expected 'ស្តូបើ, ហ្. ហេ. មី.',
         got 'ស្តូបើ, ហ្សា. ហេ. មី.' at line 606
    Expected 'ស្តូបើ ហ្. ហេ. មី.', 
         got 'ស្តូបើ ហ្សា. ហេ. មី.' at line 610
    Expected 'ហ្. ហេ. មី. ស្តូបើ', 
         got 'ហ្សា. ហេ. មី. ស្តូបើ' at line 620
    Expected 'ហ្សាហ្សីលៀ ស្.', 
         got 'ហ្សាហ្សីលៀ ស្តូ.' at line 634
    Expected 'ស្តូបើ ហ្.', 
         got 'ស្តូបើ ហ្សា.' at line 638

Jul 08 '25 00:07 markusicu

It looks like the code is abbreviating the given name differently, and sometimes the surname.

given=ហ្សាហ្សីលៀ
expected=ហ្. (HA COENG)
got=ហ្សា. (HA COENG SA AA)

Khmer characters are getting Indic_Conjunct_Break values in Unicode 17 (they were all None in 16): (HA COENG SA AA) --> (Consonant Linker Consonant None) and the AA has GCB=SpacingMark. So with GB9c and GB9a I think this is now one grapheme cluster, longer than before.

What I think we need to do is

file a CLDR ticket to adjust the person name test data for Khmer, for changes in grapheme cluster breaks
make this a logKnownIssue pointing to the CLDR ticket

@richgillam ok?

Jul 08 '25 00:07 markusicu

CLDR-18815 Khmer person name tests: adjust for Unicode 17 GCB changes

Jul 08 '25 00:07 markusicu

It looks like the code is abbreviating the given name differently, and sometimes the surname.

...

@richgillam ok?

Sorry I've been out of touch today; lot going on. I like your diagnosis and think both it and your recommended solution make sense. It wouldn't have occurred to me.

Jul 08 '25 01:07 richgillam

expected=ហ្. (HA COENG)

This was definitely a broken expectation, since the coeng is an invisible stacker rather than a virama; it should never occur without a following consonant, and its +-like glyph when it is alone is a Unicode invention.

Jul 08 '25 05:07 eggrobin

Hi @eggrobin / @richgillam / @echeran:

I need to fix at least some of the commit messages, to pass the required check.
I am thinking to partially rebase & squash, along the lines of "TODO(egg)" commit message suggestions, and probably a few more like multiple test fixes into one.
Or I could just squash the whole thing into one single commit.
WDYT?

Jul 08 '25 05:07 markusicu

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Jul 08 '25 21:07 jira-pull-request-webhook[bot]

FYI, we need a step in the ICU integration to ensure that we always rebuild the test data with the data generation. (If we don't already have that).

On Tue, Jul 8, 2025, 14:08 jira-pull-request-webhook[bot] < @.***> wrote:

jira-pull-request-webhook[bot] left a comment (unicode-org/icu#3505) https://github.com/unicode-org/icu/pull/3505#issuecomment-3050305660

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/icu/pull/3505#issuecomment-3050305660, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMFVIB3G7C2K63HVJET3HQXN5AVCNFSM6AAAAAB5UP4M2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANJQGMYDKNRWGA . You are receiving this because you were mentioned.Message ID: @.***>

Jul 08 '25 21:07 macchiati