ICU-23264 Add validateAndGet function to MeasureUnit for efficient unit validation
Description
validateAndGetfunction checks if a specific unit is available for a given type and subtype, improving performance by avoiding the need to retrieve all available units.- Updated
parseMeasureUnitOptionto utilize the new function for better error handling when validating measure units.
Checklist
- [x] Required: Issue filed: ICU-23264
- [x] Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
- [x] Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
- [ ] Issue accepted (done by Technical Committee after discussion)
- [ ] Tests included, if applicable
- [ ] API docs and/or User Guide docs changed or added, if applicable
- [ ] Approver: Feel free to merge on my behalf
Notice: the branch changed across the force-push!
- icu4c/source/i18n/measunit.cpp is different
- icu4c/source/i18n/number_skeletons.cpp is different
- icu4c/source/i18n/unicode/measunit.h is different
~ Your Friendly Jira-GitHub PR Checker Bot
Hooray! The files in the branch are the same across the force-push. 😃
~ Your Friendly Jira-GitHub PR Checker Bot
Hooray! The files in the branch are the same across the force-push. 😃
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4c/source/i18n/measunit.cpp is different
- icu4c/source/i18n/number_skeletons.cpp is different
- icu4c/source/i18n/unicode/measunit.h is different
~ Your Friendly Jira-GitHub PR Checker Bot
Thanks for jumping on this so quickly!
- The changes lgtm.
thanks
- Please add a unit test that fails without these changes.
Sure, I will add one
- I assume that Java has a similar bug or opportunity?
No, Java does not have the same bug .
The Java implementation uses a fundamentally different approach:
- Java's MeasureUnit.getAvailable(type) returns a Set<MeasureUnit>, not a fixed array
- The skeleton parsing in NumberSkeletonImpl.java (line 1098) uses:
Set<MeasureUnit> units = MeasureUnit.getAvailable(type);
for (MeasureUnit unit : units) {
if (subType.equals(unit.getSubtype())) { ... }
}
- No CAPACITY constant or buffer overflow risks exist in Java
- The inefficiency is O(n) iteration over the Set.
I think Java version has opportuniy to benefit from a direct lookup method like: MeasureUnit.forTypeAndSubtype(type, subtype)
let me see what I can do there
There can be an even better optimization. The Types are completely superfluous for all processing except looking up the unit names in CLDR.
The most efficient implementation in Java would be to have subtypes be enums (eg UnitSubtype). All internal calculations would just use those enums. Using EnumSet for sets of elements, the lookup is O(1), even better than hash lookups. Same for EnumMaps.
Each UnitSubtype would have-a String type and a MeasureUnitImpl measureUnitImpl. A MeasureUnit would have exactly one field, a UnitSubtype. (The only reason for keeping the MessageUnit API is for backwards compatibility.)
The only downside to this is that anytime CLDR adds units, the UnitSubtype would need to be extended. It would be straightforward however, to generate the enums from the CLDR units.xml file when integrating.
CC @sffc
I will be OOO next two weeks, so I need to add an update
@markusicu , I have addressed all of your comments, and still need to adjust some parts of my implementation in the C++ version
@macchiati : Thanks a lot for your suggestions, I will consider them once I return back. FYI, in the java version, the search already is O(1).