icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Improve efficiency of converting ResourceOptions (DataLocale) to DataRequest

Open sffc opened this issue 3 years ago • 7 comments

We will be able to slightly improve the efficiency of converting ResourceOptions to DataRequest if we only copy over the extensions relevant to a particular data loading operation, which we know from the ResourceKey at the point at which we create the DataRequest.

sffc avatar Jul 15 '22 03:07 sffc

#2224 is a wider improvement, we borrow DataLocales inside the DataRequest (which becomes copy and can be passed by value). The only case where we need to mutate it, during fallback, we do a single clone (@sffc and I have discussed this, and runtime fallback is already an expensive operation).

Now the problem becomes that the client might have a &Locale, a &LanguageIdentifier, or maybe in the future some borrowed FFI format. We cannot create a &DataLocale from these in its current shape.

I propose that instead of passing &DataLocales, we make it a by-value reference container:

pub enum DataLocale<'a> {
  Locale(&'a Locale),
  Langid(&'a LanguageIdentifier),
  /// We could even do this for FFI and macro-assisted construction
  ValidatedString(&'a str),
}

This representation is Copy and allows implementing strict_cmp[^2] and Writeable, which, outside fallback, is all we need from the DataLocale. It also allows implementing From<&Locale> and From<&LanguageIdentifier> without allocation, and it would reduce the stack size of a DataLocale from 72 to 9 or 5 bytes (actually probably 17 with the &str?).

For vertical fallback we need to allocate the locale anyway (a &mut doesn't help as we'd have to restore it after fallback). With #2224 we do this with DataLocale::clone, and we could continue doing that, calling the current DataLocale something like FallbackOptimizedLocale and constructing it from one of the references (for the same cost we currently pay for DataLocale::clone [^1]).

We could also create a full Locale from the DataLocale and work on that. This would reduce duplication between DataLocale and Locale, as well as LocaleCanonicalizer and LocaleFallbackProvider. Allocation-wise this is the same (we wouldn't clone non-unicode extensions), Locale just has a bigger stack size than DataLocale/FallbackOptimizedLocale. This would have the added advantage that we can return a Locale in the DataResponseMetadata, and clients don't have to deal with another type there.

[^1]: The string case would require validating in From<&'a str> as well as parsing and allocating in Into<Locale>, whereas we could also parse and allocate once at the FFI boundary. Performance will depend on whether parsing is more expensive than allocating, and if vertical fallback is even used. [^2]: We wouldn't use Locale::strict_cmp directly as we have to ignore non-unicode extensions

robertbastian avatar Jul 21 '22 10:07 robertbastian

Here's how I see things progressing.

The type being passed into the constructors is DataLocale for now, but I anticipate changing to something like Preferences in the future. Therefore, we need to perform some conversion operation between the constructor type and the data provider type.

I can see a future where DataLocale becomes Copy because it will never contain more than the following:

  1. Language, Script, Region
  2. Single Variant
  3. Region and Subdivisision
  4. Single Keyword

I can also see a future where we want to add additional fields to DataLocale to support user preferences that cannot be encoded by BCP-47, such as a default datetime pattern.

I therefore see a more optimized path from constructor to DataRequest to be along these lines:

let data_locale: DataLocale = DataLocale::new::<Marker>(&preferences);
let req = DataRequest { data_locale: &data_locale, metadata: Default::default() };

Notes:

  • DataLocale::new would use the Marker::KEY to determine which fields it needs to pull from &preferences
  • If we instead took &data_locale by reference, we could return a Cow<DataLocale> and save on allocations sometimes
  • If we want to merge this back to one line, we could make DataRequest own its locale again, or at least make the field be a Cow (but I think I like splitting it into two lines)

sffc avatar Jul 22 '22 01:07 sffc

I also want to emphasize the advantages of DataLocale over Locale:

  1. DataLocale has smaller code size thanks to a smaller Drop impl and fewer things to maybe deallocate
  2. DataLocale has smaller stack size (but, as discussed previously, I don't really know the impact of this)
  3. DataLocale allows us to add additional private fields in the future to support more use cases
  4. DataLocale contains a number of APIs that received pushback when I tried adding them to plain Locale

In short, DataLocale exists because over the last two and a half years we could not come to consensus on how to make Locale work better for the data provider use case, and we agreed to keep data loading requirements confined to the data provider, with Locale being something whose only goal is to conform to UTS 35.

In my mind it is therefore going backwards to try to "replace" DataLocale with Locale again.

sffc avatar Jul 22 '22 02:07 sffc

My thinking is that there will be some long-lived Locale object that represents the human, and many API calls will use this. In the future this might be a preferences object containing a Locale, but right now it would be a Locale, so I'm gonna use those interchangeably below.

Now if we can create a DataLocale by cheaply copying things from Preferences, great. Your example seems to do to that (it shouldn't hide an allocation). However, we're not there yet. Our From<&Locale> for DataLocale allocates, which we can stop doing if we introduce a lifetime. If we improve this to be non-allocating in the future, we can remove the lifetime, that's the kind of change we can do in a point release. However I think we should introduce it now.

The future I'm seeing is:


// Big struct, 100s of bytes, exists once per user/request/whatever
struct Preferences {
  locale: Locale,
  datetime_pattern: ...,
  ...
}

fn generate_user_message(prefs: &Preferences) {
   ...
   DateTimeFormatter::try_new(prefs, ...)
   ...
   ListFormatter::try_new_and(prefs, ...);
   ...
   Plurals::try_new(prefs, ...)
   ...
}

impl ListFormatter {
  fn try_new(prefs: &Preferences, ...) -> Self {
    ... DataRequest { locale: DataLocale::borrow::<AndListV1Marker>(prefs), ... } ...
  }
}


// Small struct, copies some important bits and borrows bigger ones
#[derive(Clone, Copy, ...)]
struct DataLocale<'a> { // DataPreferences?
  language: Language,
  region: Option<Region>,
  default_datetime_pattern: Option<&'a ...>,
  ...
}

impl<'a> DataLocale<'a> {
  fn borrow<D: KeyedDataMarker>(prefs: &'a Preferences) -> Self {
    // copy/borrow things from prefs
  }

  fn strict_cmp(&self, other: &[u8]) -> Ordering { ... }
}

In my mind it is therefore going backwards to try to "replace" DataLocale with Locale again.

I'm not advocating putting an owned Locale into the DataRequest, where your points about stack size, drop, and custom APIs all make sense. I just want to borrow. DataLocale&Locale will not be Drop, will be tiny on the stack, and we can define all the APIs on it that we want (although we just need strict_cmp, Writeable, get_langid and get_unicode_extension for lookup, which are all available on Locale as well).

I think it would be nicer to use &Locale on our public APIs. This is the type that clients will have, we shouldn't force another type on them if we can borrow internally and still be as efficient (or even more efficient, if we need different DataLocales per key, they won't be reusable).

We can also define struct Preferences(Locale) today and put that in all our APIs, so we can add all the non-BCP47 fields we want in the future.


I've ignored vertical fallback in this discussion because I think it's orthogonal. VF cannot mutate the DataLocale in the request directly (even if it owned it, if it fails it has to recover the initial state), so it always needs to make its own copy. And if it's doing that, it can use any format internally that can be easily converted into DataLocale. DataLocale in its current shape of course works, but it's not the only representation that does.

robertbastian avatar Jul 22 '22 10:07 robertbastian

Note that part of this issue is resolving that certain APIs (mainly DTF) currently require that LocaleFallbackProvider be enabled. We should make sure that the extensions are specified correctly before passing the DataLocale into the data provider.

sffc avatar Jul 23 '22 00:07 sffc

To summarize my takeaways from a discussion on this topic with @robertbastian:

  • We both agree that eventually the constructor type, which I'll call Preferences, should be different than the data provider / vertical fallback type, which I'll call DataLocale
  • My assertion that DataLocale can eventually be Copy is incompatible with the requirements of the constructor type, which needs to carry a mostly-unbounded set of keys
  • If Locale is the only type that is capable of passing options into the constructor, then it requires callers to allocate a Locale if they wish to use extension keywords, which increases code size because of Clone and Drop impls
  • It would be nice to be able to pass in instances of Locale and LanguageIdentifier directly, without having to convert them to some weird type; however, this is all likely to change once we have proper Preferences.

Note that this conversation has significant overlap with #419 and #2136.

sffc avatar Jul 24 '22 04:07 sffc

@sffc to investigate potential internal improvements here.

sffc avatar Jul 28 '22 18:07 sffc