icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Zero-copy datetime skeletons (DateSkeletonPatternsV1, datetime/skeletons@1)

Open Manishearth opened this issue 3 years ago • 11 comments

part of https://github.com/unicode-org/icu4x/issues/876, part of https://github.com/unicode-org/icu4x/issues/856

DateSkeletonPatternsV1 is one of our final remaining non-zero-copy types. It's a rather complicated tree:

pub struct DateSkeletonPatternsV1<'data>(
    pub LiteMap<SkeletonV1, PatternPlurals<'data>>,
);

pub struct SkeletonV1(pub Skeleton);
pub struct Skeleton(pub(crate) SmallVec<[fields::Field; 5]>);

// Has a ULE type already
pub struct Field {
    pub symbol: FieldSymbol,
    pub length: FieldLength,
}

pub enum PatternPlurals<'data> {
    /// A collection of pattern variants for when plurals differ.
    MultipleVariants(PluralPattern<'data>),
    /// A single pattern.
    SinglePattern(Pattern<'data>),
}

pub struct PluralPattern<'data> {
    /// The field that 'variants' are predicated on.
    pivot_field: Week,
    pub(crate) zero: Option<Pattern<'data>>,
    pub(crate) one: Option<Pattern<'data>>,
    pub(crate) two: Option<Pattern<'data>>,
    pub(crate) few: Option<Pattern<'data>>,
    pub(crate) many: Option<Pattern<'data>>,
    pub(crate) other: Pattern<'data>,
}

pub struct Pattern<'data> {
    pub items: ZeroVec<'data, PatternItem>,
    pub(crate) time_granularity: TimeGranularity,
}

There are two parts to this: firstly, the skeleton needs to be made zero-copy, and then PatternPlurals. Both need to be VarULE or ULE to work inside a ZeroMap.

Skeleton

Skeleton seems easy, we replace Skeleton with #[make_varule] struct Skeleton(ZeroVec<'data, Fields>). I'm a bit worried that this will make lookup rather slow (since Ord will be slower): but also as far as I can tell, we never use the LiteMap as an actual map at use time, we only iterate the map in order.

Another option is to replace Skeleton with the unparsed skeleton string.

PatternPlurals

I'd rather not write a custom ULE type here, but ultimately, we can. I do think, however, we can get most of the benefits by restructuring this type a bit.

Basically, this type can be structured as

#[make_varule]
struct  PatternPlurals<'data> {
    pivot_field: Option<Week>, // this needs a ULE impl for Option, which can be done.
    patterns: VarZeroVec<'data, VarZeroSlice<Option<Pattern>>>,
}

#[make_varule]
pub struct Pattern<'data> {
    pub(crate) time_granularity: TimeGranularity,
    pub items: ZeroVec<'data, PatternItem>,
}

We will need an AsULE implementation on Option<T: AsULE> as well as a VarULE implementation on Option<T: VarULE> in cases where we can guarantee that T never has length zero (this can be done with an additional trait).

Then, PatternPlurals::patterns becomes a vector that must have at least one non-None element in the beginning, and the rest of the elements can be nonexistent or None (or we can guarantee that it either has length 1 or length 6).

Thoughts?

Feedback requested:

  • [x] @sffc
  • [x] @zbraniecki
  • [x] @nordzilla
  • [x] @gregtatum

Manishearth avatar Mar 08 '22 22:03 Manishearth

I wish we could avoid making PatternPlurals a ULE.

This part of the datetime provider may be changing as part of #1317. In that issue, the skeletons will no longer be an unlimited map; there may be a limited number of possible skeletons, such that we can just make a big struct and let Postcard do the work instead of ZeroVec.

As a general rule, if we start having very deeply nested ZeroMaps, we should rethink the architecture of the data payload. Putting more into the Postcard part of the deserialization is also better for CrabBake, since CrabBake can bypass Postcard but it can't bypass ZeroVec.

sffc avatar Mar 08 '22 22:03 sffc

As a general rule, if we start having very deeply nested ZeroMaps, we should rethink the architecture of the data payload. Putting more into the Postcard part of the deserialization is also better for CrabBake, since CrabBake can bypass Postcard but it can't bypass ZeroVec.

I strongly agree with this. Though one hazard with making it a big struct is that adding fields in the future is a breaking change. We could still make that work with an array but we're back to nesting zerovecs and needing ULEs.

I accidentally posted this issue before I was done but I was considering advocating for a case where we do more of the parsing at formatter construction time instead of mandating it be zero-copy. But if this is changing in https://github.com/unicode-org/icu4x/issues/1317, that works too.

If we're going for a big struct, we can block this on #1317 and move on. @gregtatum , do you have an idea of how long #1317 will take? Moving everything to zero-copy does block CrabBake, and while we can have components opt out of CrabBake I think it would be nice if DTF didn't need to.

Manishearth avatar Mar 08 '22 22:03 Manishearth

Moving everything to zero-copy does block CrabBake, and while we can have components opt out of CrabBake I think it would be nice if DTF didn't need to.

My preference for a short-term solution, if we need it, would be to put skeletons behind a feature flag, so that DTF with default features is zero-copy.

sffc avatar Mar 08 '22 23:03 sffc

Yeah I think I can just make it so that CrabBake's provider is unable to provide certain keys

Manishearth avatar Mar 09 '22 23:03 Manishearth

Another option is to replace Skeleton with the unparsed skeleton string.

I'm not sure with Zero copy if this adds anything, as you would then have to parse the skeleton, and do similar validation work. I'm not as familiar with the implementation details of zero copy to really speak with expertise on it though.

gregtatum avatar Mar 10 '22 15:03 gregtatum

I'm not sure with Zero copy if this adds anything, as you would then have to parse the skeleton, and do similar validation work. I'm not as familiar with the implementation details of zero copy to really speak with expertise on it though.

Yep; we don't get any zc perf benefit, it just enables crab-bake (which requires everything to be zerocopy)

Manishearth avatar Mar 10 '22 17:03 Manishearth

I would also prefer to not go down the "data zero-copy but we allocate afterwards" route though

Manishearth avatar Mar 10 '22 17:03 Manishearth

that looks good to me. I'd suggest making sure that the PatternPlurals encapsulation is really good so that even internal consumers never think of it as an array and never have to reason about which plural category 3rd element is.

zbraniecki avatar Mar 10 '22 18:03 zbraniecki

it just enables crab-bake

In that case that seems fine to me.

gregtatum avatar Mar 14 '22 15:03 gregtatum

LGTM after reading the comments. I don't think I have anything else to add.

nordzilla avatar Mar 18 '22 21:03 nordzilla

This needs to be removed

robertbastian avatar Jun 14 '24 09:06 robertbastian

The data has been removed, and I'm removing the marker in https://github.com/unicode-org/icu4x/pull/5612

To track a longer-term home for these things, I created https://github.com/unicode-org/icu4x/issues/5611

sffc avatar Sep 28 '24 01:09 sffc