unicodetools Update UAX42 to document which UCDXML fields correspond to UCD properties (UAX44) vs. which are “just data” corresponding to various UCD files

The 16.0 implementation and previous doesn't differentiate between UCD properties and additional data. One example of this is the <normalization-corrections> section. Should we differentiate between properties and data, and does it make sense to maintain sections like <normalization-corrections>?

Feb 19 '25 00:02 jowilco

FYI @eggrobin re types of Unicode data

Mar 18 '25 23:03 markusicu

Should we differentiate between properties and data

Extremely yes.

I am trying to sort out this mess in https://github.com/unicode-org/unicodetools/pull/1064, and in the process I found https://github.com/unicode-org/properties/issues/390.

does it make sense to maintain sections like <normalization-corrections>?

No opinion, you are the UCDXML user here; do what you think is useful?

Mar 19 '25 01:03 eggrobin

I just wanted to write down some thoughts about UCDXML and the property types.

TR44 talks about 4 values for status for properties: Normative, Informative, Contributory, and Provisional. However, I believe that there are technically 10 PropertyStatus values as defined in PropertyStatus.java. PropertyStatus.java does align UcdProperties to sets, but see comment on line 134. UcdProperties doesn't track the PropertyStatus value, and I don't believe that data is in the UCD files either. Therefore, I don't believe that there is a single source for determining the PropertyStatus values for all UcdProperties.

In theory, I could work around this because I do have a class that stores additional metadata that I need for generating UCDXML: java/org/unicode/xml/UCDPropertyDetail.java. I could easily add PropertyStatus values, with the obvious caveat that I would then have to maintain these values. #1064 might be the solution.

The next question is how we would deliver this as part of UCDXML. I would assume that the goal is to reduce the file size for the minimum set of attributes that define a code point. There are essentially two sets of deliverables for UCDXML:

The UCDXML files themselves. There are 6 files (GROUPED|FLAT * ALL|NOUNIHAN|UNIHAN).
UAX42 and the corresponding RNC (RelaxNG) schema.

I don't think that the schema would need to change - it should validate all possible properties, and therefore all possible properties according to PropertyStatus values. I could add PropertyStatus to each attribute in UAX42, but I don't think that is the right location. UAX 42 is just describing the structure of the UCDXML files. So, the real question is what to do with the UCDXML files. We could add another dimension, so going from the existing 6 files, to 6 multiplied by potentially (ALL|NORMATIVE|OTHER) or some variant.

Note that none of this addresses the original question about <normalization-corrections>, which is a Normative property.

Apr 03 '25 15:04 jowilco

Note that none of this addresses the original question about <normalization-corrections>, which is a Normative property.

What makes it a Normative property? Or formally a UCD property at all?

The file is listed as Normative in UAX44, but it shows no properties defined by that file. Hence there are no names/aliases of associated properties. https://www.unicode.org/reports/tr44/#NormalizationCorrections.txt

The file contains historical information. It lets you find out how a few characters were normalized in earlier versions (very old ones), via a backwards delta. It is not relevant to anyone who just wants to implement Unicode Normalization for versions after 4.0.

Apr 03 '25 21:04 markusicu

@jowilco can you please make a list of files whose non-properties data is in UCDXML but is only of historical interest (e.g., will never change again)?

For example:

NormalizationCorrections.txt (never changes after 4.1)
EmojiSources.txt (never changes after 6.0)

These seem like good candidates for removal.

Apr 15 '25 20:04 markusicu

@markusicu Let's start with a review of the elements in the UCDXML files:

<repertoire> - Contains Normative (e.g., Age), Informative (e.g., Bidi_Mirroring_Glyph), Contributory (e.g., Jamo_Short_Name) or Provisional (e.g., kAlternateTotalStrokes) attributes. It also includes the Deprecated and Stabilized attributes listed in the linked sections of TR44. For several Provisional properties that have been deleted or renamed (Special_Case_Condition, Indic_Matra_Category, kIRG_RSIndex, kGB7 (@ v17), kAlternateHanYu, kAlternateJEF, kJHJ, kRSMerged, kAlternateKangXi, kAlternateMorohashi, kWubi), UCDXML might contain these attributes corresponding to the version of Unicode. Note: Attributes are on the <group> and/or <char> depending on whether it is the flat or grouped UCDXML file.
<blocks> - Duplicative data with the blk attribute in the <repertoire> section; however, the values are different: blk = Latin_1_Sup block = Latin-1 Supplement
<named-sequences> - I assume that this is still maintained.
<normalization-corrections> - no entries since 4.0
<standardized-variants> - I assume that this is still maintained. Includes emoji/text variants.
<cjk-radicals> - I assume that this is still maintained.
<emoji-sources> - Not versioned, but fixed at 6.0.
<do-not-emit> - I assume that this is still maintained.

So, I think that the candidates for removal are:

Deprecated and Stabilized properties
Normalization Corrections
Emoji Sources

We could also consider having a minimal UCDXML file that doesn't include (some of) [Informative/Contributory/Provisional] properties.

Apr 18 '25 17:04 jowilco

<repertoire> - Contains Normative (e.g., Age), Informative (e.g., Bidi_Mirroring_Glyph), Contributory (e.g., Jamo_Short_Name) or Provisional (e.g., kAlternateTotalStrokes) attributes. ...

Right. Contributory and Provisional properties might be debatable, but given what people seem to want from UCDXML, we should keep them.

It also includes the Deprecated and Stabilized attributes listed in the linked sections of TR44.

Properties, not attributes. Very debatable.

<blocks> - Duplicative data with the blk attribute in the <repertoire> section; however, the values are different: blk = Latin_1_Sup block = Latin-1 Supplement

These are long vs. short aliases, or equivalent strings according to loose matching rules. I seem to remember that Eric added block to provide the block boundaries.

<named-sequences> - I assume that this is still maintained.

Yes. These basically document "characters" that are not given single code points.

<standardized-variants> - I assume that this is still maintained. Includes emoji/text variants.

Yes. Needed for selecting glyph variants, and for not making variation selectors be free-form.

<cjk-radicals> - I assume that this is still maintained.

yes

<do-not-emit> - I assume that this is still maintained.

yes, very much so

So, I think that the candidates for removal are:

Deprecated and Stabilized properties

Normalization Corrections

Emoji Sources

sgtm

We could also consider having a minimal UCDXML file that doesn't include (some of) [Informative/Contributory/Provisional] properties.

I think users of UCDXML should do their own filtering. Otherwise we could have a lot of work providing everyone's favorite subset.

Thanks!

Apr 21 '25 22:04 markusicu

@markusicu

Properties, not attributes. Very debatable. Just FYI, I was using "attributes" as the properties are represented as attributes on a code point or group element.

Apr 21 '25 22:04 jowilco

From

https://github.com/unicode-org/unicode-reports/pull/192

“Partially addresses (this issue) by removing Deprecated properties, Normalization Corrections, and Emoji Sources.”

May 08 '25 18:05 markusicu