icu4x Fallback behavior for extension keywords and auxiliary keys

trafficstars

tl;dr, which of the following is the correct fallbacking order, assuming "short" is fallback for "long" in the aux key?

Option 1
- ar-EG-u-nu-latn/long
- ar-EG/long
- ar-u-nu-latn/long
- ar/long
- und-u-nu-latn/long
- und/long
- ar-EG-u-nu-latn/short
- ar-EG/short
- ar-u-nu-latn/short
- ar/short
- und-u-nu-latn/short
- und/short
Option 2
- ar-EG-u-nu-latn/long
- ar-EG-u-nu-latn/short
- ar-EG/long
- ar-EG/short
- ar-u-nu-latn/long
- ar-u-nu-latn/short
- ar/long
- ar/short
- und-u-nu-latn/long
- und-u-nu-latn/short
- und/long
- und/short
Option 3
- ar-EG-u-nu-latn/long
- ar-u-nu-latn/long
- und-u-nu-latn/long
- ar-EG/long
- ar/long
- und/long
- ar-EG-u-nu-latn/short
- ar-u-nu-latn/short
- und-u-nu-latn/short
- ar-EG/short
- ar/short
- und/short

There are probably more orderings.

Aug 15 '23 04:08 sffc

My commentary:

The aux key is the most important piece of the locale and therefore is the last thing that should undergo fallback
Extension keywords too are important because they are an explicit preference
We can control what goes into the root locale, so we should only put things in there which are sensical as a last-resort fallback. For example, we should not have any "long" data in root, only "short" data.

I therefore think I prefer option 3.

Aug 15 '23 04:08 sffc

Discuss with:

@Manishearth
@sffc
@robertbastian

Oct 19 '23 17:10 sffc

@robertbastian - Aux key should go last because it is usecase defined. Transliterator and datetime will use this differently. It's not considered fallback; aux keys should be ignored in everything we call fallback in icu4x.
@Manishearth - A few observations. (1) For datetime, either a locale has none of the data or some of the data. Let's say we have ar and ar-EG, and we want ar long, and ar-EG only has short, I'd prefer to go to ar-EG short first. That's logic that can be written in datetime. The thing that's missing in response metadata is number of fallback steps performed. The algorithm for datetime symbols I'd like to run is: (a) look for the aux key you're actually looking for; (b) if fallbacking was performed, perform fallback yourself, with another request; (c) if you found a solution that uses fewer steps of fallback, use it. A different way would be to request, tell me which aux keys are available for a certain key and locale, returning the first locale that has any data. This is similar in behavior to what we do currently.
@sffc - What I think is a good outcome: no matter what option we choose here, we can make sure that the outcome of data resolution is correct by ensuring the datagen provider always outputs the correct data for a particular locale with an aux key. the case we're worried about is where ar/long has different data from ar-EG/short, and if you request ar-EG/long you get different data than what you want (which is ar-EG/short). We do have a lot of cases where we have absences, like standalone vs format. The way we resolve this is in datagen provider it always outputs the full set of aux subtags, post dedup. Then we have very powerful ways of doing this fallback in datagen, where we can still strip things from datagen where the behavior is idempotent.
@sffc Why this proposal: We have a lot of evidence that complex fallbacking incurs a lot of runtime cost, and wary of getting too-too complicated of doing fallback in e.g datetime's constructor. That's slow and also error prone. Nice to resolve at datagen time. Second reason is that we don't use the DataResponseMetadata much and we aren't sure we're populating it correctly. May want to remove it.
@Manishearth - Not convinced of runtime cost but I believe ICU4C has this problem. Because we store locales, this still balloons data size overall. Another way to solve the problem (not sure I like it yet) is adding an API which lets you resolve a locale without aux keys and get an iterator because currently all data providers store them adjacently.
@robertbastian - An iterator over the aux keys?
@Manishearth - Yeah.... you say you want a locale for all aux keys and it either tells you what the aux keys are or it gives you an iterator over them. There is at least one way we can do this without changing data provider APIs.
- Design one: New API: DataProvider::load_all_aux(locale) -> Iterator<Response>
- Design two: AuxKeyQueryMarker<DateSymbolsMarker>, always returns value of type AuxKeyList or AuxKeyIterator. Buffer/etc providers are tweaked to recognize the key.
@robertbastian - The way I understand you is that you want to deduplicate across aux keys at datagen time? Datagen should not know how to fall back between aux keys.
@sffc - Only impl DataProvider<MonthSymbolsV1> for DatagenProvider is aware of long/short aux keys. It will emit ar-EG/long even if there is no explicit CLDR data for that combination.
@echeran - Can you clarify the desired behavior?
@Manishearth - The behavior we wish to attain for datetimeformat is when it attempts to load e.g. monthsymbols, it will get them from the first locale in the fallback chain with any aux keys on monthsymbols whatsoever (so if we are requesting ar-EG/long we get ar-EG/short before we get ar/long)

Make the standard fallback adapter have some specific magic behavior around aux. (solve problem in icu_locid_transform)
In datagen, always generate data such that regular fallbacking will always produce the desired behavior. Types like DateTimeFormat do nothing fancy! (solve problem in icu_datagen)
Perform the fallbacking in types like DateTimeFormatting using addtional APIs like "fallback iteration count" or "load all aux keys". (solve problem in component crate like icu_datetime)

@robertbastian - We're focused a lot on this ar-EG/long issue. I'm not convinced option 2 is the best solution for all use cases. I'm not in favor of option 1 either.
@Manishearth - I think solution 2 also could work for currencies.
@sffc Option 1 is infectious, impacting all components, even ones that don't use aux keys. Option 2 and 3 are component by component. We can make that call later. For currencies we may want that iterator.
@echeran - This sounds like a space/time tradeoff.
@sffc - Good point; and actually option 2, though it increases postcard size, may also help reduce code size since less special logic is required in the constructor.
@sffc - The horizontal fallback that we're worried about is a CLDR optimization. We should try to get rid of CLDR optimizations in datagen in general so that we only have ICU4X optimizations applied. It does not make sense to put CLDR horizontal fallback algorithms into runtime constructors in general.

Conclusion: Use either 2 or 3 on a component-by-component basis. Different components have different needs.

LGTM: @sffc @Manishearth @echeran

Oct 19 '23 18:10 sffc

We still need to discuss the part about Unicode extension keyword fallback priority.

Discuss with:

@sffc
@robertbastian

Optional:

@Manishearth

Oct 19 '23 18:10 sffc

How do DataKeyAttributes behave in fallback?

// exhaustive
struct DataRequest<'a> {
    pub langid: &'a LanguageIdentifier,
    pub attributes: &'a DataKeyAttributes,
    pub metadata: DataRequestMetadata,
}

pub struct DataKeyAttributes(DataKeyAttributesInner)

// Bump this if there's a need for more space. 
// 8 is currently needed by components that use BCP subtags as attributes
// (collator, transliterator).
const DATA_KEY_ATTRIBUTES_RUNTIME_SIZE: usize = 8;

enum DataKeyAttributesInner {
    Static(&'static [&'static str]),
    Runtime(ShortVec<TinyAsciiStr<DATA_KEY_ATTRIBUTES_RUNTIME_SIZE>>),
}

The data key attributes need not participate in fallback. They can be resolved in datagen. The constructor is allowed to fall back from one attribute to another, such as when the langid reaches und. This is compatible with preresolved fallback since it always occurs.

Segmentation model fallback can be data-driven in the segmenter constructor.

Notes for collation fallback order:

yue = yue-Hant > und-Hant > ~und~
zh-TW = zh-Hant-TW > zh-Hant > und-Hant > ~und~
yue-CN = yue-Hans-CN > yue-Hans > und-Hans > ~und~
zh = zh-Hans > und-Hans > ~und~
zh-u-co-stroke ~> zh = zh-Hans > und-Hans > ~und~, with -x-stroke data key attribute

The locales that are populated with data:

und-Hant (contains stroke data)
und-Hans (contains pinyin data)
und-Hani-x-pinyin
und-Hani-x-stroke
und-Hani-x-zhuyin

This uses a new script fallback mode:

always normalizes by adding the script using likely subtags
then chops off region, followed by language
contains extra parents to allow defining stuff for "generalized chinese things"
- und-Hant > und-Hani
- und-Hans > und-Hani
eventually reaches und

The same mode will be usable for transliterator.

LGTM: @robertbastian @sffc

Mar 20 '24 11:03 robertbastian

The rewriting of this code should incorporate the new CLDR 45 fallback rules. https://github.com/unicode-org/icu4x/pull/4782

Apr 22 '24 18:04 sffc

icu4x icu4x copied to clipboard

Fallback behavior for extension keywords and auxiliary keys

How do DataKeyAttributes behave in fallback?

icu4x
icu4x copied to clipboard