icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-22511 collator hang in comparison

Open FrankYFTang opened this issue 2 years ago • 16 comments

Avoid infinity loop by terminating and returning error if we loop the same for too many times.

Checklist
  • [X] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22511
  • [X] Required: The PR title must be prefixed with a JIRA Issue number.
  • [X] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [X] Required: Each commit message must be prefixed with a JIRA Issue number.
  • [x] Issue accepted (done by Technical Committee after discussion)
  • [X] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable

FrankYFTang avatar Oct 17 '23 23:10 FrankYFTang

to test java run mvn verify -Dtest="com.ibm.icu.dev.test.collator.CollationTest" -Dsurefire.failIfNoSpecifiedTests=false -ff

(Googlers- make sure you did the /etc/group trick on your machine to make the build faster first)

FrankYFTang avatar Oct 18 '23 18:10 FrankYFTang

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/collationcompare.cpp is now changed in the branch
  • icu4c/source/test/intltest/collationtest.cpp is different
  • icu4j/main/collate/src/main/java/com/ibm/icu/impl/coll/CollationCompare.java is now changed in the branch
  • icu4j/main/collate/src/test/java/com/ibm/icu/dev/test/collator/CollationTest.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu friendly ping. this bug was filed Nov 2, 2023 by fuzzer and and the PR is ready for review for a week.

FrankYFTang avatar Nov 20 '23 18:11 FrankYFTang

I think both are out of office for Thanksgiving week.

macchiati avatar Nov 20 '23 23:11 macchiati

I took a quick look at the code, and I am concerned. It appears to return as equal whenever three times in a row, a primary is equal. But then we would get the clearly incorrect:

aaah = aaagh

Each pass through the main loop should be resetting either the right or left CEs, or both. Because getting a ce advances an internal pointer, that should always terminate. So something fishy is going on.

I suggest that you augment your printout to provide more readable information, since lines like the following are too hard to follow: ce from left 0xa4000500 ce from left 0x2a00000005009c00 ...

I suggest when you print, break down the left and right ce values in to primary, secondary, tertiary, and put tabs between the fields so that this can be loaded into a spreadsheet. Eg something like: ... Lp 2a0000 Ls 0500 Lt 0042 ...

I can't remember, but you make also be able to fetch a code point offset from the ce; if so it would help to include that in the printout.

macchiati avatar Nov 21 '23 00:11 macchiati

sorry, I assume you look at the Java code first. My java change is indeed wrong. I should throw there not return equal.

FrankYFTang avatar Nov 21 '23 00:11 FrankYFTang

Notice: the branch changed across the force-push!

  • icu4j/main/collate/src/main/java/com/ibm/icu/impl/coll/CollationCompare.java is different
  • icu4j/main/collate/src/test/java/com/ibm/icu/dev/test/collator/CollationTest.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Thanks. I'm still a bit worried about the change, because of the aaah case, so I'd appreciate a printout for the test case along the lines I suggested.

macchiati avatar Nov 21 '23 01:11 macchiati

I looked at this and didn't understand what it was doing or why. I think I'll just defer to Mark here.

richgillam avatar Nov 28 '23 23:11 richgillam

@markusicu ping

FrankYFTang avatar Dec 08 '23 22:12 FrankYFTang

ping @markusicu . this is more urgent than other PR because it is now able to hang v8 with very simple script (see the bug for details)

FrankYFTang avatar Dec 18 '23 18:12 FrankYFTang

Somehow the bug is fixed between https://github.com/unicode-org/icu/compare/788b89321454f3eeb20782032c8444aabba42417...699fb1dbc4cfbae6f78ff0b28570f44a20a7b149

per fuzzer. Obsolete this PR

FrankYFTang avatar May 28 '24 17:05 FrankYFTang

reopen ok, the fuzzer is verify the fixed just because we increase the number of supported locales and change the testing locale to a different one. This mean later it will find another data to cause the same parameter to call into ICU and reproduce the problem later. The problem is not really fixed. If we try the tests in this PR it is still broken.

FrankYFTang avatar May 28 '24 17:05 FrankYFTang

Notice: the branch changed across the force-push!

  • icu4c/source/test/fuzzer/collator_compare_fuzzer.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

The fix is not good enough.

FrankYFTang avatar Jun 26 '24 18:06 FrankYFTang