icu ICU-22941 Revert ICU-22112, untailoring root word break

This brings the colon back into MidLetter (with no tailoring on top of the UCD), instead of its inclusion in MidLetter being an fi & sv tailoring.

Checklist

[x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22941
[x] Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
[x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
[x] Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
[x] Issue accepted (done by Technical Committee after discussion)
[x] Tests included, if applicable
[ ] API docs and/or User Guide docs changed or added, if applicable

Oct 21 '24 13:10 eggrobin

@markusicu, I am running into the same problem as in https://github.com/unicode-org/icu/pull/3028#issuecomment-2233615599: the documentation only tells me how to regenerate the old icudata.jar, not the ICU4J .brk files.

Can you perform the same thaumaturgy you did in 169023a489bd6bac215b081565f8c6a791e58f2a? (At some point it would be good to update the documentation, too…)

Oct 21 '24 14:10 eggrobin

@markusicu post-UTW poke

Oct 25 '24 17:10 eggrobin

@markusicu in fcd04fc67b63d23e9be34803d734df18ceb44868 I tried copying over the .brk files that get generated when I rebuild on my machine; these don’t seem to work:

java.lang.AssertionError
	at com.ibm.icu.impl.ICUBinary.readHeader(ICUBinary.java:574)
	at com.ibm.icu.impl.RBBIDataWrapper.get(RBBIDataWrapper.java:295)

But genbrk.cpp does not seem to have any options for output format, so I don’t understand how I can be generating brk files that work for ICU4C but not for ICU4J.

Nov 04 '24 21:11 eggrobin

I just fetched your branch and ran the rbbi tests in Eclipse. 66 tests, 2 failed. So 64 tests worked :-)

The code fails to load "brkitr/word_fi_sv.brk". ICUBinary.getData() tries to find it two ways but ends up returning null because it's not there, and it doesn't throw an exception because the caller didn't ask for it. This is a bug --> ICU-22960

I see that you updated that .brk file, but I don't see why ICU can't find it :-(

Nov 05 '24 03:11 markusicu

Oh, wait, you are deleting that file...

Nov 05 '24 03:11 markusicu

I refreshed all of the ICU4J data on my Linux box. It still fails for me in Eclipse because it still tries to load the word_fi_sv file. I don't see where it still has that registered. Pushing my files to your branch in the hope that my Eclipse is just wedged...

Nov 05 '24 03:11 markusicu

If this works, then I suspect that updating the res_index.res file did the trick. The failing BreakIteratorTest.TestT5615() loops over all locales; if there is anything anywhere leftover that refers to the old file then we have a problem.

Nov 05 '24 03:11 markusicu

I got it to work locally. You deleted the ICU4C brkitr/fi.txt & sv.txt files, but the repo still had the ICU4J .res versions. So when asked for Finnish word breaks, it found & loaded fi.res which referred to the deleted word_fi_sv.res file.

Hopefully this is it.

Nov 05 '24 04:11 markusicu

Bingo! 🎉 Please squash all but the first commit before you get ready to merge.

Nov 05 '24 05:11 markusicu

Notice: the branch changed across the force-push!

icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/res_index.res is no longer changed in the branch
icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/word_POSIX.brk is different
icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/word.brk is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Nov 05 '24 13:11 jira-pull-request-webhook[bot]

Thanks a lot. I squashed it all, dropping ca87360595265dc60c9fa11917cbfdb1872e76a7 (but keeping 081efefddb54f072e21f1ecf85fe1c15cd2aa464) to see if the brk files I generated were fine after all—it looks like it works, so I shouldn’t need to ask you to turn the crank next time.

Nov 05 '24 15:11 eggrobin

@markusicu As discussed over virtual tea, you might still want to flip the bytes.

Nov 05 '24 17:11 eggrobin

I regenerated the data and pushed the updated files. When the tests pass, please squash again.

Explanation for others: We want to keep the Java data in big-endian format, so that different people generating the data don't flip-flop on no-op data changes, and wonder why what they are doing affects BreakIterator data files.

Nov 05 '24 20:11 markusicu

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Nov 05 '24 21:11 jira-pull-request-webhook[bot]

please squash again

Squished.

Nov 05 '24 21:11 eggrobin