icu
icu copied to clipboard
ICU-22511 collator hang in comparison
Avoid infinity loop by terminating and returning error if we loop the same for too many times.
Checklist
- [X] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22511
- [X] Required: The PR title must be prefixed with a JIRA Issue number.
- [X] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
- [X] Required: Each commit message must be prefixed with a JIRA Issue number.
- [x] Issue accepted (done by Technical Committee after discussion)
- [X] Tests included, if applicable
- [ ] API docs and/or User Guide docs changed or added, if applicable
to test java run mvn verify -Dtest="com.ibm.icu.dev.test.collator.CollationTest" -Dsurefire.failIfNoSpecifiedTests=false -ff
(Googlers- make sure you did the /etc/group trick on your machine to make the build faster first)
Notice: the branch changed across the force-push!
- icu4c/source/i18n/collationcompare.cpp is now changed in the branch
- icu4c/source/test/intltest/collationtest.cpp is different
- icu4j/main/collate/src/main/java/com/ibm/icu/impl/coll/CollationCompare.java is now changed in the branch
- icu4j/main/collate/src/test/java/com/ibm/icu/dev/test/collator/CollationTest.java is different
~ Your Friendly Jira-GitHub PR Checker Bot
Hooray! The files in the branch are the same across the force-push. 😃
~ Your Friendly Jira-GitHub PR Checker Bot
@markusicu friendly ping. this bug was filed Nov 2, 2023 by fuzzer and and the PR is ready for review for a week.
I think both are out of office for Thanksgiving week.
I took a quick look at the code, and I am concerned. It appears to return as equal whenever three times in a row, a primary is equal. But then we would get the clearly incorrect:
aaah = aaagh
Each pass through the main loop should be resetting either the right or left CEs, or both. Because getting a ce advances an internal pointer, that should always terminate. So something fishy is going on.
I suggest that you augment your printout to provide more readable information, since lines like the following are too hard to follow: ce from left 0xa4000500 ce from left 0x2a00000005009c00 ...
I suggest when you print, break down the left and right ce values in to primary, secondary, tertiary, and put tabs between the fields so that this can be loaded into a spreadsheet. Eg something like: ... Lp 2a0000 Ls 0500 Lt 0042 ...
I can't remember, but you make also be able to fetch a code point offset from the ce; if so it would help to include that in the printout.
sorry, I assume you look at the Java code first. My java change is indeed wrong. I should throw there not return equal.
Notice: the branch changed across the force-push!
- icu4j/main/collate/src/main/java/com/ibm/icu/impl/coll/CollationCompare.java is different
- icu4j/main/collate/src/test/java/com/ibm/icu/dev/test/collator/CollationTest.java is different
~ Your Friendly Jira-GitHub PR Checker Bot
Thanks. I'm still a bit worried about the change, because of the aaah case, so I'd appreciate a printout for the test case along the lines I suggested.
I looked at this and didn't understand what it was doing or why. I think I'll just defer to Mark here.
@markusicu ping
ping @markusicu . this is more urgent than other PR because it is now able to hang v8 with very simple script (see the bug for details)