icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Implement and re-implement auxiliary keys

Open sffc opened this issue 2 years ago • 27 comments
trafficstars

The model for auxiliary keys is discussed in #1441. We need it for at least two components (currency and display names, #3260).

sffc avatar Jul 05 '23 08:07 sffc

What separator should we use?

  • @younies - Slash makes sense because "/length/meter" is nice
  • @robertbastian - Is there a performance difference?
  • @sffc - Probably not
  • @Manishearth - Maybe we should allow slash inside the aux key. But then it could make sense for the main separator not be slash.
  • @sffc - If we allow slashes inside the aux key then probably the main separator should be / so that directory structures make more sense
  • @robertbastian - This assumes FsDataProvider will use to_string as is, it could just to_string the locale and aux key separately
  • @younies - If all are / it looks like a URL which is good
  • @Manishearth - FsDataProvider doesn't need to have a 1:1 mapping, but I don't necesarilly think it needs to be clean; it doesn't need to be "path is the path" mapping. It's nice with sorting if the aux keys clump together.
  • @robertbastian - We've been talking about single aux keys. We could encode layered aux keys on the API. So then length/meter is exposed as a multipart aux key. And then we should use the same separator for everything.
  • @sffc - Let me explain the sorting property. en$USD < en-GB and also en < en-GB so we can compare the locid portion and that implies the sort order for the whole string, and we only need to look at the aux key if the locids are exactly equal. But en/USD > en-GB so we can no longer separate those two steps in the comparison operation.
  • @sffc - Is there anyone here who prefers a character other than / if the sorting code weren't an issue?
  • @Manishearth - I do like having the clean sorting property.
  • @robertbastian - Understand that you want sorting consistency, don't actually think that's the case. We just have to generate our data in a way that sorts the same. E.g. in baked provider we can just define our own sort that doesn't need to match string sorting
  • @sffc - I def think that the sort order where string is same as semantic is v important. Def cases where you want to serialize as strings and do easy binary search. On a case sensitive fs it's also useful for the sort order to be the same so the list of files is presorted. E.g. a url data provider may want the sort order to be the same. I think sort order is more important than slashes looking pretty. Not something I think is a hard blocker but giving precedence.
  • @Manishearth - We need data loading infrastructure to be fast, so I would prefer for search and other things to be straightforward. A string binary search is a good way of doing that.
  • @robertbastian - I don't understand the arguments about filesystem or URLs; binary search isn't used there. We use it in bake provider and blob provider. We have upgraded types in those situations so the sort order of the strings doesn't have to match sort order on stringified types.
  • @sffc - The fact that these are locales is not a property that needs to be tracked through the stack. We turn the locales into strings, and it is a useful property that the strings and locales sort in the same order. We shouldn't break that property if we don't need to for a good reason.
  • @robertbastian - The strings in Blob and Baked provider are implementation details to allow fast lookup for DataLocales. We don't need the property that they sort the same as strings
  • @Manishearth - My main constraint is that I want the sort property maintained: DataLocale should sort such that the auxiliary keys get grouped together. Whether or not their string representation also sorts that way is a nice-to-have. We might at some point have locales with and without auxiliary in the same data key, and locales should not mix.

Competing priorities:

a) Locale sort should be the same as string sort (@sffc) b) Locales with auxiliary keys should be sorted adjacent to the corresponding locale without an auxiliary key (@manishearth) c) Slashes are the character that makes most logical sense (@younies) d) For multipart auxiliary keys, the separators should be the same (@robertbastian) e) The solution should have maximal performance

  • @robertbastian - I'd prioritise them b > d > c > a
  • @sffc - I'd prioritize them a > b > d > c
  • @manishearth b > a > c > d

Options on the table:

  1. en#length#meter / en+length+meter
    • Satisfies a, b, d, e. Not c
  2. en#length/meter
    • Satisfies a, b, (c), e. Not d
  3. en/length/meter with / sorting before -
    • Satisfies b, c, d, e. Not a
  4. en/length/meter with / sorting after -
    • Satisfies a, c, d, e. Not b
  • Seems option 1 is the best overall. Now we have to pick the separator.
  • @someone: # has more special meaning in programming langauges and URLs
  • + liked by: @Manishearth @skius @robertbastian
  • # liked by: (@sffc)
  • Decision: en+length+meter

LGTM: @Manishearth @younies @sffc @robertbastian @skius

sffc avatar Aug 17 '23 18:08 sffc

For future discussion (before 1.3):

It occurred to me that another repr could be ShortVec<TinyStr8>, the same as the repr for variants and extension keywords. However, this would limit const construction to a single 8-byte aux key. On the upside, we could weasel out of all our sorting issues by sticking the aux key in -x- in the DataLocale. There was a much older design that used this model.

sffc avatar Aug 18 '23 00:08 sffc

Reopening to finalize the discussion above

sffc avatar Aug 18 '23 22:08 sffc

Follow-up for datagen filtering: https://github.com/unicode-org/icu4x/issues/3907

sffc avatar Aug 22 '23 15:08 sffc

  • @robertbastian - I was thinking, we could leave it as a string for now, and later we could figure out how it would work, but we might not be able to change that in 2.0?
  • @skius - The 8-character limit would require a different transliterator ID design, because right now it's a BCP-47 t id.
  • @robertbastian - Each subtag is at most 8 characters.
  • @sffc - en-US+foo-bar-baz would be permited, but not en-US+foobarbaz
  • @robertbastian - What other components do we want to use this for: currency, translit, display names
  • @zbraniecki - why aren't we using unicode -t- stuff?
  • @skius - because it's not in DataLocale
  • @sffc - We are using it, just in the aux key, like und+en-t-zh
  • @Manishearth - The transliterator IDs already fit in shortvecs like this, so it doesn't harm the use case for transliterator
  • @sffc - If we added in the subtag restriction, we could serialize the DataLocale as en-US-x-foo-bar-baz
  • @robertbastian - If we did that, can we use the private use extensions inner type? It's an important implementation detail because I would like DataLocale to not diverge too much from Locale
  • @sffc Any opposition to using -x for the subtags
  • @robertbastian - This has the string sorting issue, yes? If the region starts with Y or Z, yes?
  • @sffc - It should work fine; comparison works in terms of subtag comparison.
  • @younies we need more than one subtag, since for the units you can say en-US-mu-...
  • @sffc - DataLocale supports Unicode extension keywords already.
  • @Manishearth - Right now the things that we're talking about are display names (region/variant), transforms, and currencies.
  • @robertbastian - Yes, please try this. Whether we use -x- or + is a bikeshed. Please try whichever is easier.
  • @manishearth: +1 internal detail
  • @skius - Can we represent multiple auxiliary keys with -x-?
  • @sffc - We could have multiple -x- yes? und-x-foo-x-bar
  • @Manishearth - Or und-x-foo-bar-separato-baz-quux if multiple -x- isn't allowed
  • @zbraniecki - The syntax is: Subtag, 1..=8, s.is_ascii_alphanumeric(),
  • @manishearth what if the user uses -x themselves
  • @robertbastian Nah, there's no easy way for that to end up in a DataLocale
  • @skius - How do we separate these?
    1. foo-separato-bar => und-x-foo-separato-bar vs
    2. foo, bar => und-x-foo-x-bar
  • @sffc - Is the new -x- thing more desirable than the existing solution?

Conclusion: implement the -x- thing, but defer mostly to the implementer's call.

LGTM: @sffc @Manishearth @skius

sffc avatar Aug 24 '23 18:08 sffc

Thought: we should doc-hide auxiliary keys in 1.3, everything that uses them is experimental and we should also consider them experimental

robertbastian avatar Sep 05 '23 19:09 robertbastian

I have the -x- auxiliary key in #4144. It makes the subtag comparison code simpler and it also adds what I consider the nice property that we bring back 1-to-1 mapping between Locale and DataLocale.

Given this change, and given that no one has every been super happy with the AuxiliaryKeys name, I'd like to propose that we change it to AuxiliarySubtags.

sffc avatar Oct 13 '23 00:10 sffc

Discussion:

  • Use "auxiliary subtags" as the name
  • Investigate and very likely implement the use of the Private subtags struct from icu_locid inside DataLocale
  • Investigate and very likely remove the public AuxiliaryKeys type from icu_provider

LGTM: @sffc @Manishearth @robertbastian

sffc avatar Oct 19 '23 18:10 sffc

Discussion on 2023-12-12: https://github.com/unicode-org/icu4x/pull/4412#issuecomment-1852476099

sffc avatar Mar 06 '24 09:03 sffc

Start time 10:54

  • @robertbastian - There are 3 types of data being passed: DataKey, AuxKey, and Locale. The aux key should not be in the DataLocale if the DataLocale is in public APIs. It could possibly stay on DataLocale if we remove DataLocale from our constructors, but there's still the problem that we use DataLocale in other APIs such as LocaleFallbacker.
  • @Manishearth - We could keep the current design and each constructor errors if you set aux key.

(discussion moves to whiteboard)

Possible new approach:

#[derive(Copy)]
#[cfg(not(icu4x_v2))]
pub struct DataRequest<'a> {
    langid: &'a DataLocale,
    metadata: DataRequestMetadata,
}

// new
#[derive(Copy)]
#[cfg(icu4x_v2)]
pub struct DataRequest<'a> {
    langid: &'a LanguageIdentifier,
    unicode_keywords: &'a unicode::Keywords,
    auxiliary_keys: &'a ShortBoxSlice<TinyStr8>,
    metadata: DataRequestMetadata,
}

// Runs fallback on a LanguageIdentifier. Needs data.
icu_locid_transform::LanguageIdentifierFallbacker::new()

// Runs fallback with aux key. Exact signature TBD. No data.
// used by LocaleFallbackAdapter and baked provider
icu_provider::DataFallbacker::new(langid_fallbacker, data_key_config).load(data_request)

LGTM: @manishearth @robertbastian @sffc

Separate discussion on positional arguments:

  • @sffc - DataRequest is like an options bag for DataProvider. It is the same options bag for all of the four DataProvider traits. It is better than positional arguments because you always get named Default::default() behavior, because you can pass the same req parameter into all of the traits, and because you don't need to copy the fairly lengthy function signature into all of the places we implement the trait, making it easier to implement data fallback adapters.
  • @robertbastian - Positional arguments are nice because it reduces the nesting of arguments; DataReqest is not a real type, it is just a bag of arguments. It is desirable for changes to function signatures to require updating all of the call sites. In general you need to be more explicit, which is desirable.
  • @Manishearth - Eh, either way works. Positional arguments are maybe a little nicer here because it reduces the number of types. Personally I think the tradeoff between positional and structs is that structs are useful when you have >2 args in cases where "which argument is which" is ambiguous when reading callsites. That's not the case here, most calls look like provider.load(langid, aux, metadata) which is quite clear.
  • @manishearth: BUT if we add "keywords" then "keywords, aux_keys, langid" can start getting confusing at the call site. good reason to use structs.
  • @sffc - I would be open to moving out just metadata, keeping the rest of the data model in DataRequest, and change it to DataLocale
  • @robertbastian - We shouldn't name it DataLocale because aux keys are not part of the locale
  • @sffc - Ok
pub trait DataProvider {
    pub fn load(&self, req: DataRequest, metadata: DataRequestMetadata)
}

pub trait BufferProvider {
    pub fn load_buffer(&self, key: DataKey, req: DataRequest, metadata: DataRequestMetadata)
}

sffc avatar Mar 06 '24 11:03 sffc

Discussion: New shape of the API:

pub struct DataRequest<'a> {
    langid: &'a LanguageIdentifier,

    // -u-nu-arab-x-long
    // [..., tinystr!("x"), tinystr!("long")]
    // (name to be bikeshed)
    key_specific_fields: KeySpecificFields<'a>,

    metadata: DataRequestMetadata,
}

pub trait DataProvider {
    pub fn load(&self, req: DataRequest) -> Result<DataResponse, DataError>
}

Conclusion:

  • keep method signatures
  • Make DataRequest exhaustive, fields as above
  • Keep DataRequestMetadata non-exhaustive

LGTM: @sffc @robertbastian

Bikeshedding the name of the middle field:

  • @robertbastian - "DataRequest" is not a very interesting name; everything is a "request". But I can't think of a better name. But I can't think of a better one so happy to keep it
  • @sffc - DataRequestMetadata is already there, so we could keep it unless there's a much better name.
  • @robertbastian / @sffc - Given that icu_preferences is structured, it makes sense to keep the structure here; we don't necesarily need to use icu_locale_core::extensions::unicode::Keywords
  • @echeran - Something with multiple lookup tables for the aux keys
  • @robertbastian - good point, collapsing unicode extensions and aux keys doesn't let us do that very granularly
  • @sffc - The hierarchy can still be encapculated in a single type. I don't think extension keywords are used in enough places right now to justify them as a separate level from the rest of the auxiliary keys, which we expect to be used even more.
  • @robertbastian - the idea is that we can have an c-stage lookup (where c is const), but that might actually be slower than a single lookup over a table that's c times bigger
  • @echeran - Not sure about the word "specific"... I guess it's correct, but it is also general at the same time. It doesn't help me distinguish it from other concepts.
  • @robertbastian - We use the BCP-47 encoding to store this in data providers, but that is an implementation detail
  • @sffc - It will become an implementation detail
  • @echeran - I like the use of "Key" in "AuxiliaryKey" because it shows the connection to the data provider machinery's data key.
  • @Manishearth - All of these things look up data, but where does it stop being a key and start being a locale?

planned: DataKey -> Auxiliary Key -> DataLocale -> Data current: DataKey -> DataLocale -> Data proposed: DataKey -> (LanguageId, more ?) -> Data

Nouns for the thing in the middle:

  1. ~~Fields~~
  2. ~~Options~~
  3. Specifics (only for DataKeySpecifics)
  4. ~~Keys~~
  5. Attributes
  6. ~~Descriptors~~
  7. ~~Qualifiers~~
  8. ~~Filter(s)~~
  9. Constraints
  10. Extensions

Discussion:

  • @sffc - I veto "Options" because it has a very specific meaning in ICU4X: options bags as passed to constructors.
  • @robertbastian - But this is literally what this is, it's the untyped options bag.
  • @Manishearth - "Fields" is used in datetime. "Filter" is a thing that can take multiple sets.
  • @echeran - The word "filter" is like a "constraint".
  • @Manishearth - "constraints" can give you multiple options in the end. It would be like calling array indexing a "constraint".
  • @robertbastian - they do give you multiple options, you get all the locales with that constraint/filter
  • @echeran - The thing about filter/constraint is that it seems you combine multiple things to make sure the lookup is satisfied.
  • @Manishearth - I can see about the layering, mixing/matching. We have 3 layers of filters that produce an output.
  • @sffc - In ICU4X, "Options" are bare structs with no methods. The place where the term "options" could fit would be the thing called DataRequest (but I still like DataRequest). Also the term "option" is strongly overloaded because it could mean Some/None. I really don't think we should use that term again here.

Adjectives/Prefixes:

  1. Key
  2. key-specific
  3. DataKey-specific
  4. domain
  5. domain-specific
  6. auxiliary
  7. Data
  8. DataKey

Discussion:

  • @Manishearth - I like the term "key specific"
  • @echeran - If you want to connect it to the data key functionality, you could say "data key filter".
  • @sffc - I don't like "domain" because it is a new term for not a new concept.
  • @robertbastian - Key and DataKey will produce (Data)KeyFilter, which is not correct, this is not doing selection on the key level
  • @echeran - I think "specific" is redundant
  • @robertbastian - "KeyOptions" and "KeySpecificOptions" are quite different
  • @echeran - I think "DataKeyConstraint" more accurately describes the situation, because it is a combination of different things you apply to narrow down the set of data you're looking for.
  • @robertbastian - I want to convey the meaning that this has different meanings for different keys. It is not something you can just pass around.
  • @Manishearth - I agree. I think it's important to make sure people don't mix these, and that's not a hypothetical, that's very likely in datetime where we're juggling datetime vs decimal data with different aux keys.
  • @sffc - I think we should avoid introducing confusion. I don't know if it necesarilly needs to be in the name.
  • @echeran - plus one
  • @robertbastian - The name is the most powerful way to do that
  • @sffc - I kind-of prefer just Data as the prefix because that is how we name most other types in the icu_provider crate. Call it DataAttributes and move on.

Options summarized:

  1. data key specific 1a. DataKeySpecificAttributes 1b. DataKeySpecificConstraints 1c. DataKeySpecificExtensions
  2. domain or something like domain 2a. DomainSpecificAttributes 2b. DomainSpecificConstraints 2c. DomainSpecificExtensions
  3. custom 3a. CustomAttributes 3a. CustomConstraints 3a. CustomExtensions
  4. data key 4a. DataKeyAttributes 4b. DataKeySpecifics (NOTE: Not "constraints") 4c. DataKeyExtensions
  5. auxiliary 5a. AuxiliaryAttributes 5b. AuxiliaryConstraints 5c. AuxiliaryExtensions
  6. data 7a. DataAttributes 7b. DataConstraints 7c. DataExtensions

vote offline

sffc avatar Mar 11 '24 15:03 sffc

Ballots

Results: 4a and 4c are generally ranked closely, except for two ballots where they are far apart, one of which strongly favors 4a and the other of which strongly favors 4c. However, between 4a and 4c, 4a wins the head-to-head (5 to 3), so therefore 4a, DataKeyAttributes, is the winner.

sffc avatar Mar 19 '24 18:03 sffc

ICU4X-WG discussion about migration to DataKeyAttributes:

  • The change from &DataLocale to &LanguageIdentifier and DataKeyAttributes<'a> seems smallish and can be done post-1.5. The large-scale change in components is easy enough to do.
  • Primarily, the non-automatable changes will likely be for types like BlobProvider/exporters

sffc avatar May 09 '24 18:05 sffc

The changes to auxiliary keys can be done now, as they're experimental, I could get started on those. Unless @sffc is working on it, as he's assigned?

robertbastian avatar May 16 '24 12:05 robertbastian

I did the original implementation of AuxiliaryKeys. You're welcome to take the implementation of DataKeyAttributes.

sffc avatar May 16 '24 15:05 sffc

Actually due to features, this is quite messy. Currently this is all controlled by the icu_provider/experimental feature, as soon as anything moves out of DataLocale, all consumers will have to have a feature for key attributes. This includes baked data, so I don't think this is feasible before 2.0 after all.

robertbastian avatar May 21 '24 10:05 robertbastian

Would struct DataKeyAttributes<'a>(&'a str) cover all our requirements? I think all planned use cases workd with strs: datetime has fixed static strs, transliterator has to create a String anyway, currencies and locale display names will be able to borrow from TinyAsciiStr subtags, what else is there?

robertbastian avatar May 30 '24 18:05 robertbastian

Display names might need to load "en-GB" which is two subtags and requires making a string to glue them into &str. Which might be fine

Datetime might have combinations for date and time like ym0d-jms

Hmmm, &dyn Writeable? (but we don't like trait objects)

Or how about just &[Subtag]

sffc avatar May 30 '24 20:05 sffc

I'd prefer using TinyAsciiStr<8> over Subtag, I want to break the association that data key attributes have anything to do with locales. However I don't like to be restricted to 8 characters, if we used &[&str] that can still be borrowed from locale!("en-GB").

robertbastian avatar May 31 '24 06:05 robertbastian

&[&str] (and &[TinyAsciiStr]) is double the pointers for single-element attributes compared to &str in both postcard and baked data.

robertbastian avatar May 31 '24 08:05 robertbastian

&[TinyAsciiStr] has only 1 pointer.

sffc avatar May 31 '24 18:05 sffc

I don't see what the repetition gains us? Having a slice costs 16 bytes.

robertbastian avatar May 31 '24 18:05 robertbastian

&str and &[TinyAsciiStr] are the same size.

The "repetition" (tinystr slice) allows us to represent concepts such as en-GB and ym0d-jms without allocating anything.

I'm not 100% convinced which one is better. Just laying out the facts.

sffc avatar May 31 '24 18:05 sffc

Both en-GB and ym0d-jms are slightly shorter with &str though, because with TinyAsciiStr even a two-character tag takes 8 bytes (if that's the tag length we're going with).

robertbastian avatar May 31 '24 18:05 robertbastian

The data representation is a different discussion from the API representation though.

robertbastian avatar May 31 '24 18:05 robertbastian

If we end up using something tinystr-based, it would be nice (and required for performance) to make it in the API, and we might want to remove the string constructors form the API.

sffc avatar May 31 '24 20:05 sffc

Conclusion:

pub struct DataMarkerAttributes {
    // Validated to be non-empty ASCII alphanumeric + hyphen
    value: str,
}
  • This is cheapest to compare (unlike SmallVec or other schemes), and most flexible
  • DataRequest uses borrowed types anyway, in iter_requests we return Box<DataMarkerAttributes>

LGTM: @sffc @robertbastian

robertbastian avatar Jun 20 '24 13:06 robertbastian