icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Data key thoughts

Open robertbastian opened this issue 2 months ago • 5 comments

Current architecture:

pub trait DataMarker { ... }
pub trait DynamicDataProvider<M: DataMarker> { load(&self, DataKey, DataRequest) }

pub trait KeyedDataMarker: DataMarker { const KEY: DataKey; }
pub trait DataProvider<M: KeyedDataMarker> { load(&self, DataRequest) }
  • Data keys do not appear anywhere in the DataProvider paths, they are only used in DynamicDataProvider
    • BufferProvider/AnyProvider are functionally DynamicDataProvider<BufferMarker/AnyMarker>
  • Conceptually, they can be seen as dyn KeyedDataMarker
  • Users never see data keys, in the API bounds they see KeyedDataMarkers
    • Yet, we make them select data keys in datagen

Proposal 1: Align data marker names with data provider names

pub trait DynamicDataMarker { ... }
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataKey, DataRequest) }

pub trait DataMarker: DynamicDataMarker { const KEY: DataKey; }
pub trait DataProvider<M: DataMarker> { load(&self, DataRequest) }

Proposal 2: Get rid of "data key" terminology

  • Instead, call it what it really is: an erased DataMarker (new terminology)
pub trait DynamicDataMarker { ... }
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataMarkerInfo, DataRequest) }

pub trait DataMarker: DynamicDataMarker { const INFO: DataMarkerInfo; }
pub trait DataProvider<M: DataMarker> { load(&self, DataRequest) }
  • This also includes replacing any "data key" terminology in datagen by "data marker", as we'll have DataMarkerInfos on the API

Proposal 3: DataMarkerInfo.path becomes DataMarkerInfo.name

  • To me the utility of the path is not clear (it probably originated from FS provider)
  • Instead of the paths we store the name of the DataMarker (new terminology) in the tagged string.
  • If a client sees AndListV1Marker on the API, they can tell datagen to generate "AndListV1Marker" (currently they have to figure out that it's "list/and@1")
  • In error messages, the marker name is more useful than the path, because mapping this in reverse is basically impossible for clients (and I struggle with this as well)
  • Blob provider doesn't use the path
  • Baked provider uses it for macro names, but it should have just used the marker names
  • FS provider uses this for directory structure, but that's not a functional requirement; data can live at AndListV1Marker/en.json instead of list/and@1/en.json.
  • We might want to use full paths for the name (icu_list::provider::AndListV1Marker), FS provider can then do icu_list/AndListV1Marker/en.json

Proposal 4: Split DataMarkerInfo's fields (maybe)

pub trait DataMarker: DynamicDataMarker { 
  const NAME: DataMarkerName;
  const IS_SINGLETON: bool;
  const FALLBACK_CONFIG: LocaleFallbackConfig;
}
  • Having IS_SINGLETON closer to the type system could allow further optimisations
    • Might require pub trait SingletonDataMarker: DataMarker instead
  • The NAMEs can be optimised out if nothing uses them. The hash would live inside the name, so blob provider would still use the name, and static slicing would still work. However, in non-dynamic-data-provider binaries without debug logging, the names can be removed
  • In the dynamic code path, we'd need to replace the DataKeyInfo parameter by three parameters, so that's a downside to proposal

robertbastian avatar Jun 03 '24 15:06 robertbastian