icu4x
icu4x copied to clipboard
Data key thoughts
Current architecture:
pub trait DataMarker { ... }
pub trait DynamicDataProvider<M: DataMarker> { load(&self, DataKey, DataRequest) }
pub trait KeyedDataMarker: DataMarker { const KEY: DataKey; }
pub trait DataProvider<M: KeyedDataMarker> { load(&self, DataRequest) }
- Data keys do not appear anywhere in the
DataProviderpaths, they are only used inDynamicDataProviderBufferProvider/AnyProviderare functionallyDynamicDataProvider<BufferMarker/AnyMarker>
- Conceptually, they can be seen as
dyn KeyedDataMarker - Users never see data keys, in the API bounds they see
KeyedDataMarkers- Yet, we make them select data keys in datagen
Proposal 1: Align data marker names with data provider names
pub trait DynamicDataMarker { ... }
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataKey, DataRequest) }
pub trait DataMarker: DynamicDataMarker { const KEY: DataKey; }
pub trait DataProvider<M: DataMarker> { load(&self, DataRequest) }
Proposal 2: Get rid of "data key" terminology
- Instead, call it what it really is: an erased
DataMarker(new terminology)
pub trait DynamicDataMarker { ... }
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataMarkerInfo, DataRequest) }
pub trait DataMarker: DynamicDataMarker { const INFO: DataMarkerInfo; }
pub trait DataProvider<M: DataMarker> { load(&self, DataRequest) }
- This also includes replacing any "data key" terminology in datagen by "data marker", as we'll have
DataMarkerInfos on the API
Proposal 3: DataMarkerInfo.path becomes DataMarkerInfo.name
- To me the utility of the
pathis not clear (it probably originated from FS provider) - Instead of the paths we store the name of the
DataMarker(new terminology) in the tagged string. - If a client sees
AndListV1Markeron the API, they can tell datagen to generate"AndListV1Marker"(currently they have to figure out that it's"list/and@1") - In error messages, the marker name is more useful than the path, because mapping this in reverse is basically impossible for clients (and I struggle with this as well)
- Blob provider doesn't use the path
- Baked provider uses it for macro names, but it should have just used the marker names
- FS provider uses this for directory structure, but that's not a functional requirement; data can live at
AndListV1Marker/en.jsoninstead oflist/and@1/en.json. - We might want to use full paths for the name (
icu_list::provider::AndListV1Marker), FS provider can then doicu_list/AndListV1Marker/en.json
Proposal 4: Split DataMarkerInfo's fields (maybe)
pub trait DataMarker: DynamicDataMarker {
const NAME: DataMarkerName;
const IS_SINGLETON: bool;
const FALLBACK_CONFIG: LocaleFallbackConfig;
}
- Having
IS_SINGLETONcloser to the type system could allow further optimisations- Might require
pub trait SingletonDataMarker: DataMarkerinstead
- Might require
- The
NAMEs can be optimised out if nothing uses them. The hash would live inside the name, so blob provider would still use the name, and static slicing would still work. However, in non-dynamic-data-provider binaries without debug logging, the names can be removed - In the dynamic code path, we'd need to replace the
DataKeyInfoparameter by three parameters, so that's a downside to proposal
Initial thoughts:
- OK with Proposal 1
- This is a somewhat radical change to the mental model, but I can buy into it because I've never been a huge fan of the term "key", and you're right that "data marker" and "data key" refer to very similar concepts
- I agree with the problems you laid out, but I also feel that the directory-like structure has been useful to us. I think if we did Proposal 3 we should do the variant where we use the fully qualified marker name, maybe eliding the "provider" since it will be the same for all markers.
icu_list/AndListV1Marker - I think I do prefer inlining the fields to the
DataMarkertrait directly.
We could have both inlined const fields and a helper DataMarkerInfo
pub trait DataMarker: DynamicDataMarker {
const NAME: DataMarkerName;
const IS_SINGLETON: bool;
const FALLBACK_CONFIG: LocaleFallbackConfig;
}
#[non_exhaustive]
pub struct DataMarkerInfo {
pub name: DataMarkerName;
pub is_singleton: bool;
pub fallback_config: LocaleFallbackConfig;
}
impl DataMarkerInfo {
pub fn from_marker<M: DataMarker>() -> Self { /* ... */ }
}
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataMarkerInfo, DataRequest) }
But I guess if you do that you might as well just make DataMarkerInfo a const field, as in Proposal 3.
- @Manishearth - Overall I like merging the names. But the "Marker" name is overloaded.
- @sffc - Marker is a common name that's used for the type-system level objects in Rust. I think there's some risk of overloading "marker" but I'm not too worried. THere are ways to make clear whether a thing is a DataMarker or some other type of marker.
- @sffc - Maybe we can talk about the path structure.
- @robertbastian - https://github.com/unicode-org/icu4x/blob/main/provider/datagen/src/registry.rs
- @sffc - The Rust types already have a hierarchy. It would be nice to reflect that hierarchy in the string representation. But we can probably cut off the
icu::and the::provider::and then snake case the rest. - @robertbastian - I don't like modifying the rust path, because if you then get such a string in an error message, you still have to reverse the mapping, even if it's a more regular mapping than now
- @robertbastian - We can use the Rust
stringify!macro if we keep the exact same path. It also makes them more searchable. - @robertbastian - There's also https://github.com/unicode-org/icu4x/issues/4193
Summary of additional discussion later:
- @robertbastian values converging on a single identifier. The dueling identifiers (Rust type name and string path) harm understandability, usability, and searchability.
- @sffc is favorable toward using a single identifier. @Manishearth is neutral or slightly favors having two identifiers since they service different use cases.
- @sffc values hierarchy and brevity. The hierarchy enables use cases such as filtering and diagnostics, and it makes it easier and less error-prone to write custom datagen logic. The brevity directly impacts binary size.
- @sffc puts a higher value on these properties than on the Rust names and would prefer names such as
DateTime_Year_Chinese_V1which are both hierarchical and brief, despite not being conventional Rust names. @sffc argues that these are weird types that only appear in trait bounds so it is fine to establish a new paradigm for how we name them. "V1Marker" is already a paradigm; we're swapping one for another. - @robertbastian and @Manishearth put a higher value on Rust conventional type names. @robertbastian points out that filtering can still be implemented using custom application logic to parse the camel case. @Manishearth acknowledges that filtering gets more difficult.
- @robertbastian proposed
icu::datetime::data::year::ChineseV1as the identifier. @sffc is potentially okay with this compromise, since it has hierarchy and filterability, but is disappointed that it sacrifices brevity.
But the "Marker" name is overloaded
to expand on this a little bit (but not to rabbithole too deep because I don't care about "Marker" strongly): I find Marker to be a good default, but it's nice when it can be replaced with a useful concept name instead. I think we may have options here.
one thing I proposed in that discussion that isn't included here was encoding the heirarchy in the trait, so you can have something like:
impl DataMarker for DaySymbolsV1Marker {
const COMPONENT = "datetime";
const SUBCOMPONENT = Some("symbols");
}
Notes from the scratch pad:
pub fn try_new_unstable<D>(
provider: &D,
locale: &DataLocale,
options: CollatorOptions
) -> Result<Self, DataError>
where
D: DataProvider<CollationSpecialPrimariesV1Marker> + DataProvider<CollationDataV1Marker> + DataProvider<CollationDiacriticsV1Marker> + DataProvider<CollationJamoV1Marker> + DataProvider<CollationMetadataV1Marker> + DataProvider<CollationReorderingV1Marker> + DataProvider<CanonicalDecompositionDataV1Marker> + DataProvider<CanonicalDecompositionTablesV1Marker> + ?Sized,
pub fn try_new_unstable<D>(
provider: &D,
locale: &DataLocale,
options: CollatorOptions
) -> Result<Self, DataError>
where
D: DataProvider<Collation_SpecialPrimaries_V1> + DataProvider<Collation_Data_V1> + DataProvider<Collation_Diacritics_V1> + DataProvider<Collation_Jamo_V1> + DataProvider<Collation_Metadata_V1> + DataProvider<Collation_Reordering_V1> + DataProvider<Collator_CanonicalDecomposition_Data_V1> + DataProvider<Collation_CanonicalDecomposition_Tables_V1> + ?Sized,
pub fn try_new_unstable<D>(
provider: &D,
locale: &DataLocale,
options: FixedDecimalFormatterOptions
) -> Result<Self, DataError>
where
D: DataProvider<DecimalSymbolsV1Marker> + ?Sized
pub fn try_new_unstable<D>(
provider: &D,
locale: &DataLocale,
options: FixedDecimalFormatterOptions
) -> Result<Self, DataError>
where
D: DataProvider<Decimal_Symbols_V1> + ?Sized
Collation_CanonicalDecomposition_Tables_V1TimeZone_IanaMapper_ToBcp47_V2orTimezoneIanamapperTobcp47V1MarkerTimeZone_IanaMapper_ToIana_V1TimeZone_Metazone_Period_V1TimezoneIanaToBcp47V1,TimezoneBcpToIanaV1,icu::timezone::provider::iana_mapper::ToIanaV1Marker
Current: DangiDatePatternV1Marker + “datetime/patterns/dangi/date@1”
Shane Proposes: DateTime_Patterns_Date_Dangi_V1
Rob Proposes: DatetimePatternsDateDangiV1Marker, icu::datetime::provider::patterns::date::DangiV1Marker
where: P: DataProvider<patterns::date::DangiV1Marker> + DataProvider<patterns::date::GregorianV1Marker> ...
where: P: DataProvider<DateTime_Patterns_Date_Dangi_V1> + DataProvider<DateTime_Patterns_Date_Gregorian_V1> ...
icu::datetime::provider::patterns::date::DangiV1Marker
datetime/patterns/date/DangiV1Marker/en.json
datetime/patterns/date/dangi@1/en.json
Discussion:
- @Manishearth Users typically don't use the string key, but it has its use cases and might show up in errors from time to time. I think the status quo is fine. I don't think it's a good idea to make our data loading path depend on the Rust path. I want to be able to move things around in code without changing the data path. I think we've made decent choices in each case.
- @hsivonen Do the string keys get stored as the full path somewhere, so if the string key is even longer, we take a binary size hit?
- @robertbastian In the latest proposal, ... we use the hash more. The places these show up are in error messages and in fs.
- @robertbastian The reason I proposed this is that it is confusing to have to IDs for the same thing. They see in docs
ListV1Markerbut in datagen they need to selectlist/and@1. I fixed this by accepting both formulations in datagen. But they still see the path3 in errors. - @sffc We could/should do an audit of our marker and path names. We've landed them but never really took a holistic view. Then we could establish a convention between the Rust IDs and the string IDs.
- @sffc If datagen accepts marker names, then the marker names are effectively stable.
- @robertbastian Doesn't seem 2.0 critical, except for the audit of the paths.
- @robertbastian I think we should consider anything around markers to be unstable.
Some proposals:
- Proposal A: Consider marker names to be stable. They can't be renamed in minor releases. They are accepted in datagen.
- Proposal B: Consider marker names to be unstable. They can be renamed as we like. The paths remain stable. They are accepted in datagen because if you use the marker names, you are subject to their stability policy.
- Proposal C: Same as Proposal B but reject marker names in datagen.
More discussion:
- @robertbastian We've never needed to rename a marker. I don't see why we should make our UX worse to reserve this possibility.
- @sffc I'm leaning toward Proposal B but not actually renaming anything until we do a proper audit, and in some future release we can do that audit. If we do it in say 2.2, then the 2.2 release notes say so.
- @robertbastian People don't usually read point release notes. Things just shouldn't break. If the audit is not important enough for 2.0, maybe it's just not important
- @sffc Data slicing and data providers are part of the ICU4X value proposition, so we should make them nice. If we had more resources, I would like to do it in 2.0, but I agree it's a lower priority.
- @Manishearth If you're using them, you're using them. No one looks at them holistically other than us.
- @robertbastian We should at least do an audit in datetime because they live 5 traits away and the documentation should explain how to figure out which markers you'll need
Conclusion:
- Don't rename markers in 2.x. Accept them in datagen.
- Could revisit in 3.0 but it might not be a high priority even then.
- Perform an audit on neo datetime markers.
- Add this auditing to the graduation checklist. For now it is a vague requirement; could evolve into something more specific in the future.
LGTM: @sffc @robertbastian @Manishearth
I am assigned this issue for 2.0 to do an audit of datetime markers. [^1]
The audit, whether in icu_datetime or elsewhere, brings back the data marker naming discussion where we are at an impasse (the lack of an option that has no veto); see https://github.com/unicode-org/icu4x/issues/5276. I do not know how I am expected to perform the audit without having a consensus on how to name things.
I would like to therefore propose the following rules which largely encode the status quo but make a few changes that better enforce consistency and circumvents the casing controversy by avoiding types that would trigger it, like we did for datetime field sets.
| Thing | Description | Naming Rule | Example: icu_decimal | Example: icu_datetime |
|---|---|---|---|---|
| Data struct | A data struct that is the Yokeable of a data marker |
Based on the data it holds and/or the context in which it is used, with version number, without crate namespace | SymbolsV1 | MonthNamesV1 |
| Helper type | A struct or enum that is a field of a data struct but is not itself the Yokeable of a data marker |
Based on the data it holds and/or the context in which it is used, without version number, with optional crate namespace | DecimalSymbolStrs | PackedPluralElements |
| Data marker | A non-exhaustive unit struct used in try_new_unstable function bounds |
Based on the associated data struct. No crate namespace. Suffixed with "Marker". May add an infix after the version number if there are multiple markers using the same data struct. Usually exported directly from the provider module in the crate. |
SymbolsV1Marker | MonthNames V1GregorianMarker |
| Data path | A string associated with a data marker which forms the basis for the hash key and filesystem paths | Directly derived from the data marker name by converting it to snake case, changing V1Marker to @1, and adding the crate name prefix without icu_. If there is an infix (which we can detect by comparing the data struct name to the data marker name), the infix is separated by a slash instead of an underscore in the path. |
decimal/symbols@1 | datetime/month_names/ gregorian@1 |
| Bounds in ctors | The type that is written in the bound of try_new_unstable and shows up in docs |
If the marker is in the same crate as the ctor, import it. If the marker is in a different crate, use the fully qualified path. | icu_decimal::provider:: SymbolsV1Marker (fully qualified only if cross-crate) |
icu_datetime::provider:: MonthNames V1GregorianMarker (fully qualified only if cross-crate) |
| Datagen arguments | How data markers can be specified as arguments to icu4x-datagen | Path names only. No marker names. | decimal/symbols@1 | datetime/month_names/ gregorian@1 |
There's a lot to digest in that table, but I think we can only have a productive discussion holistically.
I'll highlight a few key changes in this proposal:
- Marker names no longer have a crate prefix. This is done to circumvent the casing controversy.
- As a result, marker names are no longer accepted in icu4x-datagen, rolling back https://github.com/unicode-org/icu4x/pull/5060
- A handful of types will undergo name changes in this proposal
Obviously @robertbastian isn't here to give feedback on this, but I don't know any other solution that is actionable. The alternative is "drop the datetime audit and leave everything as-is to revisit in 3.0", acknowledging that the status quo is a bit of a mess but that the TC was unable to reach consensus.
[^1]: There aren't that many data markers. The vast majority are the ones in the datetime crate, which we already slated for the audit, and the properties markers, which are generated by macros. So, I'm volunteering to do the audit for all crates in 2.0.
I'm in favor of this. I don't necessarily think we need to require qualification for cross crate stuff. Currently because of provider it will lead to long icu_decimal::provider::SymbolsV1 imports which isn't ... great. But it's ultimately fine.
I'm in favor of this. I don't necessarily think we need to require qualification for cross crate stuff. Currently because of
providerit will lead to longicu_decimal::provider::SymbolsV1imports which isn't ... great. But it's ultimately fine.
Part of my thinking was:
- We don't have that many cross-crate bounds
- The cross-crate bounds only show up in
try_new_unstable<P>(...) where P: ..., not intry_new() with crate::provider::Baked: ... - It's kind-of nice that cross-crate is more explicit since it is more likely to require special handling, such as the user needing to fork between multiple bake providers
Thought: we could make the infixed marker names be MonthNamesV1GregorianMarker. Then the path could always be derived directly from the marker name and vice-versa, without having to know the data struct name.
icu_foobar::provider::HelloWorldV1Marker <=> foobar/hello_world@1
icu_foobar::provider::HelloWorldV1LoremIpsumMarker <=> foobar/hello_world/lorem_ipsum@1
In other words, it would be possible to write a fn that maps from type_name::<M>() to the path name and vice-versa. I think this is one of the things @robertbastian would have liked to have seen in a solution here.
@sffc "we don't have many cross crate bounds" sells me. And yeah I don't mind the explicitness too much.
Thought: we could make the infixed marker names be MonthNamesV1GregorianMarker. Then the path could always be derived directly from the marker name and vice-versa, without having to know the data struct name.
Yeah this makes a lot of sense to me! I think we prefixed it because we always did, but I actually don't see a strong reason to prefix.
Yeah this makes a lot of sense to me! I think we prefixed it because we always did, but I actually don't see a strong reason to prefix.
Just highlighting that my last reply above was to put the infix between the "V1" and the "Marker"?
The rule for naming a data marker would be: [Struct]V[Version][Infix]Marker
If we do [Struct][Infix]V[Version]Marker, it is not as easy to figure out what part of the name is Struct and which part is Infix, which matters for the path conversion.
Just highlighting that my last reply above was to put the infix between the "V1" and the "Marker"?
Oh, I see. Hmmmmm. I'd rather have it always be V1Marker but I'm not that picky about it.
Just highlighting that my last reply above was to put the infix between the "V1" and the "Marker"?
Oh, I see. Hmmmmm. I'd rather have it always be V1Marker but I'm not that picky about it.
I had the thought, "isn't it nice to search for V1Marker and find everything", but then I realized that there are V2Marker, V3Marker, etc. With this new naming convention, you can still grep for V\d\w*Marker.
Also, a possible advantage of this naming scheme is that the data struct name is fully a substring of the data marker name. If the data struct is SymbolsV1, you either stuff Marker or InfixMarker at the end of it, but the prefix will always be the data struct name.
WG discussion:
- @Manishearth I like
V1Markerbeing a thing, but the point about it becomingV2Markeris valid. So splitting it in the middle is fine. We have a good scheme where everything is in provider.rs, so finding markers is not hard. - @sffc Would appreciate at least one more review on the naming scheme before I go all in on it.
- @echeran I can take a look.
- @hsivonen I can maybe look but it would be next week.
I talked this over with @echeran today. He pushed back on the position from https://github.com/unicode-org/icu4x/issues/4991#issuecomment-2350760637 that we can't just use the Rust code path as the marker name and as the data path.
One other thing is that the whole MonthNamesV1CopticMarker idea, putting the key namespace after the V1, is really growing on me. It is the "coptic marker" for the data struct named MonthNamesV1. It is easier to parse as a human, not only as a machine. I've spent a lot of time over the years clicking into the docs of markers to see what the data struct is supposed to be, and this solves that problem.
In other words: something like DateTimeCopticMonthNamesV1Marker is descriptive, but it does not help me as an ICU4X developer (or as a client) figure out what it actually is in code.
@echeran also astutely asked, why are we even discussing this? What problem are we trying to solve?
The problem we are trying to solve is making a clear framework for how to name these types which are public API but not in the spotlight. For types that are in the spotlight, we spend a lot of bandwidth bikeshedding. I want to reduce as much as possible the bandwidth we spend bikeshedding data marker names and paths, but also still have a cohesive model that works with datagen and for clients who need custom data.
I focus on baked data, so I feel I don't have enough experience with using the marker-using API entry points from code that uses ICU4X as a library.
My high-level comments are:
- I think we should minimize the need to carry or compare long strings in run-time binaries (i.e. should prefer small strings both due to binary size and for comparison execution time).
- The trait bounds look intimidating (see https://github.com/unicode-org/icu4x/issues/4890 from the developer of https://lib.rs/ who clearly knows Rust), so changes that would make the bounds easier to understand would be welcome.
- "key" vs. "marker" as terminology is probably not a helpful distinction for users of the API.
- Adding underscores to the names probably make things even stranger.
- To the extent we have marker type names and path strings, making the correspondence between the two obvious is helpful. (But see the first point of this list.)
- Calling the dynamic option
Dynamicprobably helps with seeing that there is run-time overhead. - If any of this is relevant to avoiding overhead in Diplomat FFI in the baked case, that's worth considering.
One other thing is that the whole
MonthNamesV1CopticMarkeridea, putting the key namespace after theV1, is really growing on me. It is the "coptic marker" for the data struct named MonthNamesV1. It is easier to parse as a human, not only as a machine. I've spent a lot of time over the years clicking into the docs of markers to see what the data struct is supposed to be, and this solves that problem.
I find this compelling.
I think we should minimize the need to carry or compare long strings in run-time binaries (i.e. should prefer small strings both due to binary size and for comparison execution time).
I'm open to the idea of storing only the 4-byte hash in the binary instead of the full path string. FsDataProvider would need some workarounds, but that's fine.
@robertbastian Do you agree with https://github.com/unicode-org/icu4x/issues/4991#issuecomment-2504732608?
I updated it to include the change I suggested in https://github.com/unicode-org/icu4x/issues/4991#issuecomment-2505073780
- @robertbastian We don't need
V1on data structs. Markers, yes. If a version number changes, we can rename the old structLegacyor something. - @manishearth if we have a MonthNames struct we can go from
GregorianMonthNamesV1MarkertoGregorianMonthNamesV2Markerand it doesn't really matter if it points toMonthNames(but different) orFancyMonthNamesor whatever. - @robertbastian If we don't have
V1on the struct, then we could dropMarkerfrom the marker as everything with aVnwould be a data marker - @sffc This sounds pretty good, but it doesn't solve the one issue I've been trying to solve: making the path algorithmically derivable from the marker name. For example, how do we know to transform
icu::datetime::provider::MonthNamesBuddhistV1todatetime/month_names/buddhist@1 - @robertbastian Maybe we just always use slashes, like
datetime/month/names/buddhist@1/3/en.json. Or just use the marker name directly, likedatetime/BuddhistMonthNamesV1/3/en.json. - @sffc There's also the filtering requirement. I'm happy with
datetime/month/names/buddhist@1/3/en.json. I'd also be happy withdatetime/month_names_buddhist@1/...ordatetime/month_names_buddhist_v1/.... - @robertbastian - This is a datagen only requirement. I think we should be talking about this in terms of marker names, not paths (we don't really care how the paths look on the FS). You can write a regex
MonthNames.*V1or.*MonthNamesV1 - @sffc In my model,
MonthNamesV1JapaneseExtendedMarkerbecomesdatetime/month_names/japanese_extended@1which can be distinguished fromdatetime/month_names/japanese@1via regex filtering. - @robertbastian - it's weird than one marker has a prefix fragment of another marker. The other marker should just be
MonthNamesJapaneseModernV1 - @manishearth you can do that with
MonthNamesV1JapaneseExtendedMarkerjust as well. - @manishearth: to be clear: my proposal is that we no longer have paths. Fs uses
cratename/MarkerName, and we pick marker names by some scheme that allows for filtering with regexes and/or prefix matching. I'm okay with V1 living anywhere in it, I'm okay with the no-prefix-marker requirement, I'm mildly in disfavor of removing Marker but won't oppose it.
Proposal:
- Data struct names do not have
V1in them. Data struct names are unstable, and when introducing changes we may rename the data structs for the transition period, such asFooLegacyorFooNeo. - Data markers need not match data struct names, but can use data struct names for inspiration.
Markergoes away andV{n}goes at the end of the marker name.- The data marker name is used directly in FS provider. There is no concept of a path. (
datetime/MonthNamesBuddhistV1/3/en.json) - Data marker variations, such as
BuddhistandGregorian, must be distinguishable by a regex substring match. This means that one should not be a prefix of another. For example,Japanesemust be renamedJapaneseModernin order to not clash withJapaneseExtended.
LGTM: @robertbastian @sffc @manishearth
To reiterate part of https://github.com/unicode-org/icu4x/issues/4991#issuecomment-2504732608 that we didn't discuss:
I proposed that:
- Data markers do not have the crate name in their name
- The crate name, without
icu_, is included in the data marker path - In constructor signatures, markers from the current crate are imported, and markers from different crates are fully qualified (note: I believe rustdoc will also reflect this)
Examples of the implications of this:
icu_decimal::provider::DecimalSymbolsV1becomesicu_decimal::provider::SymbolsV1icu_list::provider::ListAndV1becomesicu_list::provider::AndV1
To reiterate my positions:
- The data path used in
icu4x-datagenCLI should be suitable for filtering with regexes or glob patterns - The FS layout should be the same as what is used in
icu4x-datagenCLI
Why I therefore conclude that we need namespaces:
- It allows filtering by component
- Better filesystem layout
A less strongly held, but perhaps more compelling, reason to prefer namespaces is that it significantly reduces the burden of bikeshedding specific marker names. If marker names are namespaced, I don't need to care too much when graduating a new component that has a marker with a potentially ambiguous name such as NamesV1. But if they are not namespaced, it adds yet another thing we need to bikeshed during graduation which carries more weight.
For example, one might conclude that a marker such as RangePatternV1 being added to icu_decimal is "safe": by any taste test, it seems fairly innocuous. However, then someone tries adding RangePatternV1 to icu_datetime, and we get stuck.
(note: I believe rustdoc will also reflect this)
nope: https://unicode-org.github.io/icu4x/rustdoc/icu/experimental/duration/struct.DurationFormatter.html#method.try_new_unstable
Why I therefore conclude that we need namespaces:
- It allows filtering by component
- Better filesystem layout
Both of this is possible with a policy that marker names start with the component name.
The problem with component level namespacing is the same as with the discussed hierarchical data markers: the name of a marker doesn't give you the full identity, you always have to pass additional data. This is annoying to do on the datagen side, and impossible in Rustdoc
nope
😦 Haven't looked if there's a workaround. Agreed this is unfortunate. Seems like a tooling problem, though, which could be fixed.
Both of this is possible with a policy that marker names start with the component name.
Not quite, because I can't simply rm -r data/datetime to get rid of datetime data. This is something I've been able to do for years. I guess I could rm -r data/DateTime* so long as we create more policies involving crate naming. DateTimeSymbolsV1 could be "time symbols in the date crate" or "symbols in the datetime crate". We'd end up with a web of policies which need to be enforced by the core ICU4X team:
- Data marker name must start with component name
- Component name cannot be ambiguous (share prefix) with any other component name
- If there is an infix (like Buddhist), it cannot be ambiguous with any other infix
But, I still strongly prefer a solution that delivers a clean, clear, unambiguous taxonomy of components, markers, and marker variants. These policy-based solutions seem extremely suboptimal to me.
Not quite, because I can't simply rm -r data/datetime to get rid of datetime data. This is something I've been able to do for years. I guess I could rm -r data/DateTime*
FS can pull the first word into its own level if we decide to adopt the naming policy
Also keep in mind that the prefixes you've been using have not been accurate, we've had timezone prefixes in datetime, datetime prefixes in calendar, properties prefixes in locale, etc. It worked well enough for development hacking in the past and will work well enough for development hacking in the future.
Some discussion between @robertbastian @Manishearth and @sffc
Generally, we all seem to agree on having at least a three-level ontology of (module)/(type name)/(variant) where some of these can be empty. In other words, there is general agreement that:
- Markers should be organized by a rough set of "modules"
- Markers that are for the same data struct type probably should share a part of their name, and have some suffix indicating which variant of the data it is (e.g. Gregorian vs JapaneseModern
We generally agree on the existence of the following use cases for a marker ontology that shows up in the marker identifier
- Filtering data markers via regex or some other mechanism
- Visually sanity-checking that the right markers are included
- Being helpful signposts for users who need to delve through and select their own markers (see https://github.com/unicode-org/icu4x/issues/2685, we have many users who cannot use keyextract and need to deal with markers directly)
The relative importance of each of these use cases is not necessarily agreed upon.
Listing various individual positions/beliefs/axioms (please let me know if I've missed something):
@sffc:
- Visually scanning CamelCaseIdentifiers for subsets is more tedious than
slash/separated,Unicode_Case, and other things with explicit delimeters - "ICU4X should have a clear ontology for its data markers", where clarity includes visual/human-readable clarity of the identifiers
- There is some value in doing arbitrary levels of ontology based on the type: it is valuable to be able to distinguish
datetime/names/monthanddatetime/names/yearfromdatetime/pattern
@robertbastian:
- The way the Rust identifier shows up in the docs is important. There should not be bounds like
+ DataProvider<SymbolsV1> + DataProvider<SymbolsV1>showing up in docs because we have twoSymbolsV1types in different crates whose full identifiers areicu_decimal::SymbolsV1andicu_datetime::SymbolsV1. - Does not wish to use
Unicode_Casein Rust identifiers - Wishes to be able to move data markers between crates
- Do not wish there to be any auxiliary information for data marker identity that needs to be passed around aside from the marker type name
@Manishearth
- Do not want there to be multiple different identifiers for the same key
- Moderately strongly: multiple different identifiers that are different in ways that are not an algorithmic two-way mapping (e.g.
datetime/month/names/japaneseextendedvsDateTimeMonthNamesJapaneseExtended) - Weakly: multiple different identifiers in any form (e.g. having a slash-based identifier at all)
- No dislike of optional redundant prefixes, e.g. allowing both
icu_datetime::DateTimeMonthNamesGregorianV1and bareDateTimeMonthNamesGregorianV1.
- Moderately strongly: multiple different identifiers that are different in ways that are not an algorithmic two-way mapping (e.g.
- Does not wish to use
Unicode_Casein Rust identifiers as it will be surprising and attention-grabbing in an unnecessary way.
Some additional observations made:
- Rustdoc always shows the types
- Our crate heirarchy is not necessarily the best "coarse grained top level ontology", though it is a reasonable default. It is driven by
- There are multiple different ways to wish to slice things, it is not necessarily true for a crate that a user will wish to separate
datetime/names/*fromdatetime/patternsbut a different user won't wish to separatedatetime/names/monthanddatetime/patternsfromdatetime/names/year
Some designs that were discussed:
- The identifier is the fully qualified path to the marker
icu_datetime::MonthNamesJapaneseExtendedV1- downside: can be ambiguous in docs
- The identifier is
DatetimeMonthNamesJapaneseExtendedV1- downside: does not scan well for Shane, potentially has prefix/suffix matching issues for filtering. Those can be fixed via policy around not having e.g.
JapanesevsJapaneseExtendedand instead doingJapaneseModern
- downside: does not scan well for Shane, potentially has prefix/suffix matching issues for filtering. Those can be fixed via policy around not having e.g.
- The identifier is
DateTime_MonthNames_JapaneseExtendedV1and also the type name- downside: this is not Rust style, Robert/Manish opposed
There were a lot more, these were some interesting ones.
A path forward we all seemed to be okay with was the following:
- The ontology supports an arbitrary number of levels, chosen at a per-data-marker level.
- In CamelCase, each segment is a "level". It's considered fine that JapaneseModern and JapaneseExtended will share a level, and that they will nest an additional level compared to Buddhist.
- DataMarker Debug output shows identifiers both as
DatetimeMonthNamesJapaneseExtendedV1anddatetime/month/names/japanese/extended@v1. Both types of identifiers can be used in datagen/etc. This is a purely programmatic transform.
The proposed solution still doesn't taste good to me; there are the following two requirements I have that aren't stated in the previous post, both of which are mentioned earlier in this thread:
- I prefer avoiding policy-based solutions in order to keep the identifiers being an ontology.
- The required system, that capital letters in the camelcase identifier are level separators, is hard to understand both for me and for anyone in the future contributing to ICU4X.
- I prefer having just one identifier for a marker.
- Previously we had two: a path string and the Rust type, which are independent. The new proposal essentially makes the path string derived from the Rust type.
- I think the path being derived from the Rust type is an improvement, but it would be nice if the outcome of all this work reduced the number of identifiers down to one.
- @Manishearth has this listed, but it isn't listed for me.
These two points are not satisfied by the proposal.