Upgrade UnicodeProperty foundation
Replace the use of old APIs in UCD.java and related files by using the API in the org.unicode.props package instead.
Background V1: The code in the org.unicode.text.UCD and UCA packages, and related packages is very old. It was written when Java was very young, and not nearly as powerful as it later became. The performance was also a fraction of what is currently available on modern computers, which necessitated a lot of hacks for speed. It is ugly to maintain and extend; eg, adding a new property value requires changing some int constants in UCD_Types (the code predated enums) and some string arrays in UCD_Names.
UCD ucd = UCD.make("14.0.0"); byte intEnum = ucd.getBidiClass(codepoint); String propValue1 = UCD.getBidiClassID_fromIndex(intEnum, UCD_Types.LONG); String propValue1s = UCD.getBidiClassID_fromIndex(intEnum, UCD_Types.SHORT);
System.out.println("V1:
= " + propValue1 + ", " + propValue1s);
V2: A layer using UnicodeProperty was built on top of this, which made it somewhat easier to write code without knowing as much about the underlying implementation.
ToolUnicodePropertySource propertySource = ToolUnicodePropertySource.make("14.0.0"); UnicodeProperty property2 = propertySource.getProperty("BidiClass"); List<String> propNames2 = property2.getNameAliases(); String propValue2 = property2.getValue(codepoint); List<String> propValueAliases = property2.getValueAliases(propValue2);
System.out.println("V2: " + propNames2.get(1) + " = " + propValueAliases.get(1) + ", " + propValueAliases.get(0));
V3: Later, the tooling in org.unicode.props was developed. It uses a much more general parsing mechanism — as data-driven as possible — and more modern API (more type-safe, ...). It has a few more advantages; it only parses (and caches) the files it needs, and the caches are flushed if any of the data files are more recent. Etc.
IndexUnicodeProperties unicodeProperties = IndexUnicodeProperties.make(Age_Values.V14_0); // can also use VersionInfo UnicodeMap<Bidi_Class_Values> map = unicodeProperties.loadEnum(UcdProperty.Bidi_Class, UcdPropertyValues.Bidi_Class_Values.class); Bidi_Class_Values propValue3e = map.get(codepoint); // get typesafe map
System.out.println("V2: " + UcdProperty.Bidi_Class + " = " + propValue3e + ", " + propValue3e.getShortName());
However, one can also use as a UnicodeProperty:
UnicodeProperty property3 = unicodeProperties.getProperty(UcdProperty.Bidi_Class);
String propValue3 = property3.getValue(codepoint);
To do this task Hard to tell exactly, but I think the following would work.
Phase1 — get rid of old code
- Change ToolUnicodePropertySource to extend IndexUnicodeProperties
- Remove all the old classes that support the old ToolUnicodePropertySource (UCD, UCD_Names, ....)
- Refactor to remove ToolUnicodePropertySource
Now, this wouldn't change the APIs, so one wouldn't get the advantage of type-safety, etc. But the foundation would be much stronger. One could use those APIs whenever it was convenient.
How to update the "V3" code: https://github.com/unicode-org/unicodetools/blob/main/docs/newunicodeproperties.md
For how to use it, Mark filed issue #200 for adding API docs to e.g. https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/props/IndexUnicodeProperties.java
Consider annotating old, undesirable APIs as @deprecated.
Mark started a related doc: “Modernizing UnicodeTools”
@macchiati would you consider interface UnicodeProperty deprecated with the v3 IndexUnicodeProperties? Your description sounds like that might just be for ease of transition of older code?
would you consider interface UnicodeProperty deprecated with the v3 IndexUnicodeProperties? Your description sounds like that might just be for ease of transition of older code?
I would instead say that UnicodeProperty is what we want to retain, and the UnicodeMap should be an implementation detail (and that we should try to get rid of calls to IndexUnicodeProperties.loadMeow, since the UnicodeProperty is the thing that can do LM3 matching, know about aliases, deal with multivalued properties, return the actual Name rather than something with a #, etc.
We can certainly discuss focusing on the UnicodeProperty API, but I think we would still need to make some additions (and deprecations) to that API. Off the top of my head:
- For codomains that are not simple strings (eg sets/lists/enums/...), it would be cleaner and safer to also have APIs that were typesafe. That can be done without exposing the load... APIs
- The short/long/alternate names for properties and enums are awkward, and we could use the mechanism in props (perhaps with some alterations) to clean that up.
- We could leave around the older APIs for ease of migration, or do a refactoring to get rid of them.