icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-22941 Revert ICU-22112, untailoring root word break

Open eggrobin opened this issue 1 year ago • 1 comments

This brings the colon back into MidLetter (with no tailoring on top of the UCD), instead of its inclusion in MidLetter being an fi & sv tailoring.

Checklist

  • [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22941
  • [x] Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [x] Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • [x] Issue accepted (done by Technical Committee after discussion)
  • [x] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable

eggrobin avatar Oct 21 '24 13:10 eggrobin

@markusicu, I am running into the same problem as in https://github.com/unicode-org/icu/pull/3028#issuecomment-2233615599: the documentation only tells me how to regenerate the old icudata.jar, not the ICU4J .brk files.

Can you perform the same thaumaturgy you did in 169023a489bd6bac215b081565f8c6a791e58f2a? (At some point it would be good to update the documentation, too…)

eggrobin avatar Oct 21 '24 14:10 eggrobin

@markusicu post-UTW poke

eggrobin avatar Oct 25 '24 17:10 eggrobin

@markusicu in fcd04fc67b63d23e9be34803d734df18ceb44868 I tried copying over the .brk files that get generated when I rebuild on my machine; these don’t seem to work:

java.lang.AssertionError
	at com.ibm.icu.impl.ICUBinary.readHeader(ICUBinary.java:574)
	at com.ibm.icu.impl.RBBIDataWrapper.get(RBBIDataWrapper.java:295)

But genbrk.cpp does not seem to have any options for output format, so I don’t understand how I can be generating brk files that work for ICU4C but not for ICU4J.

eggrobin avatar Nov 04 '24 21:11 eggrobin

I just fetched your branch and ran the rbbi tests in Eclipse. 66 tests, 2 failed. So 64 tests worked :-)

The code fails to load "brkitr/word_fi_sv.brk". ICUBinary.getData() tries to find it two ways but ends up returning null because it's not there, and it doesn't throw an exception because the caller didn't ask for it. This is a bug --> ICU-22960

I see that you updated that .brk file, but I don't see why ICU can't find it :-(

markusicu avatar Nov 05 '24 03:11 markusicu

Oh, wait, you are deleting that file...

markusicu avatar Nov 05 '24 03:11 markusicu

I refreshed all of the ICU4J data on my Linux box. It still fails for me in Eclipse because it still tries to load the word_fi_sv file. I don't see where it still has that registered. Pushing my files to your branch in the hope that my Eclipse is just wedged...

markusicu avatar Nov 05 '24 03:11 markusicu

If this works, then I suspect that updating the res_index.res file did the trick. The failing BreakIteratorTest.TestT5615() loops over all locales; if there is anything anywhere leftover that refers to the old file then we have a problem.

markusicu avatar Nov 05 '24 03:11 markusicu

I got it to work locally. You deleted the ICU4C brkitr/fi.txt & sv.txt files, but the repo still had the ICU4J .res versions. So when asked for Finnish word breaks, it found & loaded fi.res which referred to the deleted word_fi_sv.res file.

Hopefully this is it.

markusicu avatar Nov 05 '24 04:11 markusicu

Bingo! 🎉 Please squash all but the first commit before you get ready to merge.

markusicu avatar Nov 05 '24 05:11 markusicu

Notice: the branch changed across the force-push!

  • icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/res_index.res is no longer changed in the branch
  • icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/word_POSIX.brk is different
  • icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/word.brk is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Thanks a lot. I squashed it all, dropping ca87360595265dc60c9fa11917cbfdb1872e76a7 (but keeping 081efefddb54f072e21f1ecf85fe1c15cd2aa464) to see if the brk files I generated were fine after all—it looks like it works, so I shouldn’t need to ask you to turn the crank next time.

eggrobin avatar Nov 05 '24 15:11 eggrobin

@markusicu As discussed over virtual tea, you might still want to flip the bytes.

eggrobin avatar Nov 05 '24 17:11 eggrobin

I regenerated the data and pushed the updated files. When the tests pass, please squash again.

Explanation for others: We want to keep the Java data in big-endian format, so that different people generating the data don't flip-flop on no-op data changes, and wonder why what they are doing affects BreakIterator data files.

markusicu avatar Nov 05 '24 20:11 markusicu

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

please squash again

Squished.

eggrobin avatar Nov 05 '24 21:11 eggrobin