icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Bridge the gap between `icu::properties::Script` and `icu::locale::subtags::Script`

Open robertbastian opened this issue 1 year ago • 12 comments

Currently we for example define locale directionality on the locid type, so clients that have a properties script cannot use this API. Example: https://github.com/googlefonts/fontc/blob/9466281d982d49a2b7b19845e5f13d720c1682d4/fontbe/src/features/properties.rs#L65

robertbastian avatar Mar 19 '24 10:03 robertbastian

There are a few differences, the most well known one being that Hani is a property script with Hans and Hant the corresponding subtag scripts.

Probably worth adding a conversion anyway, though, with the caveats listed.

sffc avatar Mar 19 '24 19:03 sffc

CC our SAH and PAG fellows, @Manishearth @eggrobin @markusicu

sffc avatar Mar 19 '24 19:03 sffc

I think "conversion with caveats" might be fine. Yes, they represent different things.

Manishearth avatar Mar 21 '24 07:03 Manishearth

Conversion is probably fine, but in the end they are just script codes, so it also makes sense to define the full set once and have Unicode APIs use a subset of the values.

The ones in the UCD are a subset of the full set.

And only the ones in the UCD have Unicode-defined long value names (identifiers).

markusicu avatar Mar 23 '24 00:03 markusicu

Our name lookups are already fallible, so they could just return None on non-UCD scripts.

robertbastian avatar Mar 25 '24 10:03 robertbastian

Is this table available in data, or do we need to crate it from the spec/Wikipedia?

robertbastian avatar Mar 25 '24 11:03 robertbastian

See https://unicode.org/iso15924/iso15924.txt, linked from https://unicode.org/iso15924/codelists.html.

The PVA column is from https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt.

eggrobin avatar Mar 25 '24 11:03 eggrobin

Also https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry look for Type: script

which becomes this in CLDR: https://github.com/unicode-org/cldr/blob/main/common/validity/script.xml

Note that the CLDR list includes one or more private use script subtags:

  • https://www.unicode.org/reports/tr35/#unicode_script_subtag_validity
  • https://www.unicode.org/reports/tr35/#Private_Use_Codes

Qaag is current but yucky... Don't include Qaai which has become an alias for Zinh

markusicu avatar Mar 25 '24 22:03 markusicu

  • @sffc - We already have a table for mapping the script numeric ID to the string ID. It's called the enum to TinyStr4 name mapper.
  • @robertbastian - Can we make the properties::Script type use the TinyStr4-underlying repr internally?
  • @sffc - That would require loading the table all the time, as well as an indirection to load things from the table. It seems unnecessary. Not all clients need or want the TinyStr4 representation.
  • @robertbastian - I don't think we should make clients convert between the two script types on their end. We should choose a single representation. The only value I see in exposing the u16s is interaction with ICU4C, which I'm not even sure is a use case
  • @sffc - It's a rough corner of the API, but I think we should err on the side of modularity. We can make the conversion functions be nicer to use.
  • @echeran - LocaleDirectionality spans two crates, which is weird.
  • @sffc - Both of these "script" types are context-specific. Are these even things we want to promote to be the canonical representation of a script?
  • @zbraniecki - The icu_locid representation is definitely designed as a script subtag.
  • @robertbastian - Unless we include the whole IANA registry it's always going to be open. Also properties only ever return scripts, it's currently an open enum already.
  • @sffc - My position is still that what we have now is the most modular, efficient solution. We shouldn't deviate from that. We can make nice conversion functions, even From and Into gated on #[cfg(feature = compiled_data)]. But I don't think we should force all users to use the slower, bigger code path. I don't think the motivation is compelling enough for that.
propnames/to/short/linear4/sc@1, und, 802B, 55c3455e15d1d2ae
  • @robertbastian - My original proposal was to use the tinystr representation in the data structs themselves, this way there is no conversion cost. It will make the CPT slightly bigger, but this could be offset by conversion code size even.
  • @sffc - Maybe that would work.. it would change the value size from 2 to 4, which overall is less data than the additional lookup table. However, we lose the ability to return the ICU4C enumerated integers, which I believe is something that we should support so we can be a drop-in replacement.
  • @robertbastian - I think that can be modularily added, ICU4C compatibility is not universally required
  • @sffc - It would be easiest and most self-consistent to just keep all of the properties APIs returning integers.

No conclusion yet.

sffc avatar May 02 '24 18:05 sffc

Separately from the discussion of performance, I do think these are two different kinds of things and I would overall prefer us to have an explicit separation of types even if it may be annoying.

I don't think this is an ICU4C compat thing as much as it is a property thing. We implement the Unicode standard which has specific property values for this property, even if our numbers were different I'd still want us to use an open enum here rather than strings,

Manishearth avatar May 02 '24 19:05 Manishearth

I would argue that the concept of a "script" is different than either a Script Subtag or a Script Property.

A principled approach would be to introduce a new Script type with a private inner representation. The type can have conversions to and from both subtags::Script and properties::Script. This new type would also have a function to get the script directionality.

However, I'm not convinced at this time that we have a clear need for that type, nor do I have an idea where such a type would live. Therefore, I tend to think that we should keep the two types context-specific and just focus on the conversion between them.

sffc avatar May 03 '24 22:05 sffc

I generally agree. I'm not really sure I can see any way a third type would make sense, but I think having the split is somewhat valuable, provided it's easy to convert.

Manishearth avatar May 04 '24 00:05 Manishearth

@robertbastian to leave a comment suggesting what else to be done and then add to meeting agenda

sffc avatar Sep 17 '24 17:09 sffc