unicodetools icon indicating copy to clipboard operation
unicodetools copied to clipboard

Propertiness

Open eggrobin opened this issue 11 months ago • 6 comments

A classification of properties derived from presence in PropertyAliases, or derived from a field that we are forced to fill in in ExtraPopertyAliases (contrast PropertyStatus.java which is out of date).

In character.jsp, split the information into (UCD properties, non-UCD properties, UCD non-properties, non-UCD non-properties), with a further split for Unihan (out of UCD properties and after UCD non-properties). See it in staging:

  • https://unicode-jsps-staging-o2ookmn2oq-uc.a.run.app/UnicodeJsps/character.jsp?a=A7FE
  • https://unicode-jsps-staging-o2ookmn2oq-uc.a.run.app/UnicodeJsps/character.jsp?a=3400

eggrobin avatar Mar 18 '25 21:03 eggrobin

I'd suggest that the top be the properties on https://www.unicode.org/reports/tr18/#RL2.7, perhaps with those groupings.

Put all Contributory and Provisional into a separate bucket.

Not sure what the parens are for, as in (kEH_Core)

Some values don't have links, eg "Obsolete"

Identifier_Status Restricted Identifier_Type Obsolete

The If you are going to have a bucket Non-UCD properties for U+A7FE, then add confusable, emoji, ...

Will look it over more tomorrow.

macchiati avatar Mar 19 '25 02:03 macchiati

Not sure what the parens are for

Provisional, see the heading Normative, Informative, Contributory, and (Provisional) UCD properties.

I'd suggest that the top be the properties on https://www.unicode.org/reports/tr18/#RL2.7, perhaps with those groupings.

Finer property status (splitting out Contributory etc.) and groupings would be nice, but we do not have a maintainable way of keeping track of it so far (there was an attempt with PropertyStatus.java, but as noted in the PR description, that did not work). Here I am instead doing what I can based on what we are forced to maintain, namely *PropertyAliases.txt.

Some values don't have links, eg "Obsolete"

Yes, that is because it is multivalued, see https://github.com/unicode-org/unicodetools/pull/1018 item 2.

If you are going to have a bucket Non-UCD properties for U+A7FE, then add confusable, emoji, ...

Confusable is there, it goes into Non-UCD non-properties (Other information). The Identifier_* stuff is what UTS39 actually describes as a property.

RGI_Emoji (but not RGI_Emoji_*_Sequence) should be there because it is described as a property in UTS51, but isn’t because it is hacked directly into the JSPs instead of being in IndexUnicodeProperties; I will add it later, see the TODOs in ExtraPropertyAliases.

eggrobin avatar Mar 19 '25 02:03 eggrobin

Here I am instead doing what I can based on what we are forced to maintain, namely *PropertyAliases.txt.

Note that beyond the cosmetics of grouping character.jsp, we actually want to keep track of the « is this a UCD property » information, see #1049.

eggrobin avatar Mar 19 '25 02:03 eggrobin

Note: I tried splitting out Provisional from Normative+Informative, and that seemed counterproductive for Unihan and Unikemet (which are the only places where we have Informative properties) to have them in two blocks; hence the parentheses approach.

eggrobin avatar Mar 19 '25 02:03 eggrobin

RGI_Emoji (but not RGI_Emoji_*_Sequence) should be there because it is described as a property in UTS51

Ah nevermind, I see UTS51 also describes the RGI_Emoji_*_Sequence zoo as properties. I’ll fix that.

eggrobin avatar Mar 19 '25 02:03 eggrobin

As noted in the TODOs, I’d like to move RGI_Emoji and IDNA2008_Category into IndexUnicodeProperties (rather than being patched into the JSPs), and to add RGI_Emoji_Qualification, all of these being NonUcdProperty.

But I will do that in a subsequent PR.

eggrobin avatar Mar 21 '25 16:03 eggrobin

@markusicu Friendly ping, since I think some of @jowilco’s work is blocked on this.

eggrobin avatar Apr 08 '25 12:04 eggrobin