icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-22661 Limit the size of variants in Locale

Open FrankYFTang opened this issue 1 year ago • 13 comments

Limit the size of acceptable variants in Locale to avoid long running time of resource fallback look up when trying all different combinations of many variants.

I first try to limit to the max size of 10 variants but some test break, so I change it to 20 now.

Also ensure the Locale::getVariant will not return invalid address while the locale is bogus.

Checklist
  • [X] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22661
  • [X] Required: The PR title must be prefixed with a JIRA Issue number.
  • [X] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [X] Required: Each commit message must be prefixed with a JIRA Issue number.
  • [x] Issue accepted (done by Technical Committee after discussion)
  • [X] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable

FrankYFTang avatar Feb 08 '24 22:02 FrankYFTang

Another way to attack this, that would be much faster, is not try resource fallback for every one of the variants in a locale.

Instead, do a lazy build of all the variant combinations that actually occur in the resource bundles, eg

0: {posix} 1: {posix valencia} 2: {foo}

When doing resource fallback, cycle through these (in the example, 0..2) and only try the ones that are contained within the set of variants. That will typically be none for oddball variants, and (very) occasionally one or two of the combinations.

macchiati avatar Feb 08 '24 23:02 macchiati

Another way to attack this, that would be much faster, is not try resource fallback for every one of the variants in a locale.

This seems like a very good idea to me.

roubert avatar Feb 09 '24 14:02 roubert

Another way to attack this, that would be much faster, is not try resource fallback for every one of the variants in a locale.

Instead, do a lazy build of all the variant combinations that actually occur in the resource bundles, eg

That sounds good, but runs into old complications with enumerating the set of available locales / bundles.

Enumerating is relatively easy when they are in a single file or place, but we let users provide data in one or more .dat files and/or bunches of single files in the file system (with different OS APIs) and/or linked in with one or another type of TOC... and tangled up with service registration caches. In C++ and Java.

This would be really good to figure out, but shouldn't block other progress.

markusicu avatar Feb 09 '24 17:02 markusicu

I first try to limit to the max size of 10 variants but some test break, so I change it to 20 now.

If we think that 10 is already plenty, then it should be ok to reduce the number of variants in unit tests.

markusicu avatar Feb 09 '24 17:02 markusicu

This would be really good to figure out, but shouldn't block other progress.

OK, if putting limits on the length and number of variants is the best path forward, let's figure out how to best do that.

First of all, I'd like to have a unit test for this that fails if language tags with too long or too many variants are accepted, so that we'll see immediately if some future change accidentally breaks the checking for length and number.

Then I find Locale::initBaseName() to be a rather odd place for this logic, far too late in the processing. It should at least be moved up the chain to Locale::init() where variantBegin is set. But I wonder if it wouldn't be even better to move the logic all the way up to _getVariant() in uloc.cpp so that the same limits are enforced for all of ICU4C and not just for code that uses class Locale.

roubert avatar Feb 12 '24 17:02 roubert

Notice: the branch changed across the force-push!

  • icu4c/source/common/locid.cpp is no longer changed in the branch
  • icu4c/source/common/uloc.cpp is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

First of all, I'd like to have a unit test for this that fails if language tags with too long or too many variants are accepted, so that we'll see immediately if some future change accidentally breaks the checking for length and number.

see testLongVariants

FrankYFTang avatar Feb 14 '24 01:02 FrankYFTang

PTAL

FrankYFTang avatar Feb 14 '24 01:02 FrankYFTang

First of all, I'd like to have a unit test for this that fails if language tags with too long or too many variants are accepted, so that we'll see immediately if some future change accidentally breaks the checking for length and number.

see testLongVariants

The problem with that test is that it doesn't fail if a language tags with overlong variants gets accepted, it just runs for a long time and then passes anyway. That's not helpful to verify that the logic for rejecting overlong variants works. I'd like to have a test that fails if the detection of overlong variants fails.

Such a test could be very simple:

  1. Verify that a language tag with the maximum length of variants is accepted.
  2. Verify that a language tag with variants one character longer than the maximum is rejected.

roubert avatar Feb 14 '24 13:02 roubert

First of all, I'd like to have a unit test for this that fails if language tags with too long or too many variants are accepted, so that we'll see immediately if some future change accidentally breaks the checking for length and number.

see testLongVariants

The problem with that test is that it doesn't fail if a language tags with overlong variants gets accepted, it just runs for a long time and then passes anyway. That's not helpful to verify that the logic for rejecting overlong variants works. I'd like to have a test that fails if the detection of overlong variants fails.

Such a test could be very simple:

  1. Verify that a language tag with the maximum length of variants is accepted.
  2. Verify that a language tag with variants one character longer than the maximum is rejected.

changed . Added those tests. Also add Java test which will show stack overflow if not including the fix. PTAL

FrankYFTang avatar Feb 15 '24 22:02 FrankYFTang

While I think the updated test is much better as it now will fail if the rejection of overlong variants fails, I still don't understand why you don't want to have a simple test that just tests this directly. If the rejection of overlong variants fails, your updated test runs for an entire minute on my workstation before it eventually fails. Isn't our CI already slow enough as it is?

Please take a look at the unit tests that I just added, which test the C API and C++ API for locale ID parsing directly and verify that parsing of variants of the maximum allowed length succeeds and that parsing of variants one character longer than that fails. The code is perfectly trivial and runs in a fraction of a second.

What advantage is it that you see in instead going through creating DateTimePatternGenerator objects and having loops that build very large strings? I don't understand it.

roubert avatar Feb 16 '24 18:02 roubert

What advantage is it that you see in instead going through creating DateTimePatternGenerator objects and having loops that build very large strings? I don't understand it.

that is what the test case in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=65761 trigger the problem. I would like to see this fix indeed fix that particular case

FrankYFTang avatar Feb 21 '24 22:02 FrankYFTang

Please take a look at the unit tests that I just added, which test the C API and C++ API for locale ID parsing directly and verify that parsing of variants of the maximum allowed length succeeds and that parsing of variants one character longer than that fails.

I think what you added are good, and we should include them. But that test does not verify the hang in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=65761 no longer hang.

FrankYFTang avatar Feb 21 '24 23:02 FrankYFTang

But that test does not verify the hang in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=65761 no longer hang.

I still don't get it. The reason that DateTimePatternGenerator::createInstance() runs for a very long time (it doesn't actually hang, if not terminated by a timeout it'd eventually finish) is not because of anything in DateTimePatternGenerator itself but in the underlying resource loading, which is used also in many other places around ICU4C.

If you want to test that ures_getFunctionalEquivalent(), the resource loading function that currently runs for a very long time when given a locale ID with many variants, no longer will try loading resources for too many variants, then I think the natural thing would be a unit test for that particular function, not for any randomly selected user facing function that eventually calls it somewhere within its implementation.

Please take a look at the unit test that I just added, which does just that. The code is perfectly trivial and runs in a fraction of a second.

What is it that you think that such a unit test fails to cover?

As for the loops that you propose to use to build very large strings, what is the advantage of having those? The testLongVariants() test case currently in this PR will triple the total running time of the entire intltest test suite if the code that's supposed to abort when there are too many variants fails to do that, and that really doesn't seem like a reasonable increase in running time to me.

roubert avatar Mar 04 '24 17:03 roubert

I still don't get it. The reason that DateTimePatternGenerator::createInstance() runs for a very long time (it doesn't actually hang, if not terminated by a timeout it'd eventually finish) is not because of anything in DateTimePatternGenerator itself but in the underlying resource loading, which is used also in many other places around ICU4C.

The only way we first figure out the reason is that run for a very long time is AFTER we debug the test I added, right? So... how could we ever know what "The reason that ... is becaue" ? I want to added test to ensure in the future, if anything else change, it will not introduce any other bug to cause such regression for any other reason. How can we ensure such particular breakge will stay fixed withoug any regression in the next 20 years without adding such a test? It is true that the current fix does fix the problem, but there are no other test to ensure in the future no body will introudce new bugs to cause the same bug unless I add such a test, right? In the future, there could be other reason to cause the same bug by the same test cases if we do not add the regression test, right?

FrankYFTang avatar Mar 04 '24 18:03 FrankYFTang

As for the loops that you propose to use to build very large strings, what is the advantage of having those?

The advantage is it mimic the EXACT failure condition what the fuzzer report in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=65761 without assuming the possible error could only come from the fix we found.

FrankYFTang avatar Mar 04 '24 18:03 FrankYFTang

The only way we first figure out the reason is that run for a very long time is AFTER we debug the test I added, right? So... how could we ever know what "The reason that ... is becaue" ?

We know the reason because we were alerted to a problem, debugged it, and found the root cause, as always.

How can we ensure such particular breakge will stay fixed withoug any regression in the next 20 years without adding such a test?

We'll find it in the future in exactly the same way as you found it this time, the same fuzzer test that found this problem this time will find any new such problems in the future.

There's a lot of different ICU4C functionality that ends up calling ures_getFunctionalEquivalent() so in your 20-year perspective we could easily imagine all of DateTimePatternGenerator and its tests being deleted, so if testing were to be done through DateTimePatternGenerator that would then leave the rest of the callers without a test. That wouldn't be good and that's why we have unit tests to ensure that each piece works as intended and then fuzzer tests to test every possible input at every possible entry point.

There's nothing special about DateTimePatternGenerator here, it's nothing but a coincidence that this was the entry point through which you found out that there was a problem with ures_getFunctionalEquivalent().

It is true that the current fix does fix the problem, but there are no other test to ensure in the future no body will introudce new bugs to cause the same bug unless I add such a test, right?

That's not right, the fuzzer test that made you aware of this problem this time can be expected to continue to work and to find new such problems if they appear in the future.

In the future, there could be other reason to cause the same bug by the same test cases if we do not add the regression test, right?

That's not right, there is no reason to believe that the fuzzer test will suddenly cease to work.

The advantage is it mimic the EXACT failure condition what the fuzzer report in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=65761 without assuming the possible error could only come from the fix we found.

OK, then I think I understand what it is that you're trying to achieve, you're trying to replicate some of the fuzzer test functionality in a unit test.

That doesn't seem like a good idea to me. First of all, we already have a working fuzzer test (which, after all, is how you found that there was a problem here), there's no need to replicate or replace that. It would seem excessive to me to have duplicate fuzzer tests for this entry point and I can't see any value in moving this one fuzzer test away from the other fuzzer tests into the unit tests instead. And, in order for your testLongVariants() to really work as a fuzzer test you'd need to introduce some kind of timeout, for as it is now it'll just run for a long time and then eventually pass in case ures_getFunctionalEquivalent() doesn't abort as we want it to.

I think it'd be better to keep unit tests as unit tests and fuzzer tests as fuzzer tests.

roubert avatar Mar 05 '24 12:03 roubert

I think it'd be better to keep unit tests as unit tests and fuzzer tests as fuzzer tests.

ok, I will then remove my origional unit test. that make sense.

FrankYFTang avatar Mar 05 '24 17:03 FrankYFTang

Notice: the branch changed across the force-push!

  • icu4c/source/common/uloc.cpp is different
  • icu4c/source/common/unicode/locid.h is different
  • icu4c/source/test/cintltst/cloctst.c is different
  • icu4c/source/test/intltest/dtptngts.cpp is no longer changed in the branch
  • icu4c/source/test/intltest/dtptngts.h is no longer changed in the branch
  • icu4j/main/collate/src/test/java/com/ibm/icu/dev/test/util/ICUResourceBundleCollationTest.java is now changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/format/DateTimeGeneratorTest.java is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@roubert PTAL. I removed my origional test that you have issue with and port the test you proposed to Java

FrankYFTang avatar Mar 05 '24 21:03 FrankYFTang

and port the test you proposed to Java

You'd want to add a Java unit test also for ULocale, wouldn't you?

roubert avatar Mar 07 '24 11:03 roubert

PTAL

FrankYFTang avatar Mar 07 '24 19:03 FrankYFTang

@roubert PTAL

FrankYFTang avatar Mar 11 '24 19:03 FrankYFTang

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • icu4c/source/common/uloc.cpp is different
  • icu4j/main/core/src/main/java/com/ibm/icu/impl/LocaleIDParser.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot