cldr
cldr copied to clipboard
CLDR-11155 Test for ST pages with too many rows
CLDR-11155
- [ ] This PR completes the ticket.
ALLOW_MANY_COMMITS=true
I took the code I made to find out the page sizes, and made it into a test. Once the emoji pages are handled, the number of failures will go down. (We can also raise the error threshold or make some exceptions if we want.)
https://github.com/unicode-org/cldr/pull/3508 is merged, so I'll restart the tests on this one
The test still fails, including:
build: tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestPathHeader.java#L1707
(TestPathHeader.java:1707) Error: am Characters Symbols3 has too many entries: 684
build: tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestPathHeader.java#L1707
(TestPathHeader.java:1707) Error: ar Date & Time Fields has too many entries: 705
build: tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestPathHeader.java#L1707
(TestPathHeader.java:1707) Error: ar Units Volume has too many entries: 740
@macchiati should I keep dividing pages until the test passes with the max being 600? Anyway I was about to start on Volume/Volume2
I think we identified the pages that we wanted to address. At this point we probably want to set the limit a bit higher so we just knock down the bad cases.
For volume, I recommend that we split by the system. Volume Metric and Volume Other.
Some notes:
From any unit you can get its system, and check it.
make a static final UnicodeSystems METRIC = Sets.of(UnitSystem.metric, UnitSystem.metric_adjacent);
extract the unit from the path this is the long unit ID, which has an extra prefix (historic)
get a UnitConverter from the SupplementalDataInfo use getShortId to convert to the short form use getSourceToSystems to get a map, then use the short ID with it to get the system set if the set of systems intersect with the constant metric, then put it in the first batch. Collections.disjoint can be used to test intersection
Volume Metric and Volume Other
OK! I was in the middle of another approach, trying to split about halfway down starting Volume2 with "Gallon" but what you suggest sounds more meaningful
@btangmu Please integrate this in, setting the error size limit to a value that works, and incorporating into your work. You can reset the error size limit down after further work.
@macchiati This was the basis for #3563 and #3573 -- now redundant; OK to close?