icu4x DateTimeFormat CombinedPattern

Fixes #1131.

Nov 17 '21 23:11 zbraniecki

This is a plow-through draft of how a CombinedPattern may look like. I got it to the point where I am able to run work_log and criterion datetime/datetime_lengths and am getting results and can verify that for work_log I am seeing CombinedPattern with three ZeroVec::Borrowed for date/time/generic.

The overall architecture seems decent and I am ready to start carving out PRs for review that overall may improve the experience, but for the combined pattern itself, I'm not sure if the change is worth it.

The memory diff is from:

[memory] > dhat: Total:     18,818 bytes in 94 blocks
[memory] > dhat: At t-gmax: 18,450 bytes in 84 blocks
[memory] > dhat: At t-end:  1,024 bytes in 1 blocks

to:

[memory] > dhat: Total:     18,659 bytes in 92 blocks
[memory] > dhat: At t-gmax: 18,411 bytes in 83 blocks
[memory] > dhat: At t-end:  1,024 bytes in 1 blocks

and for criterion from:

datetime/datetime_lengths                        
                        time:   [78.778 us 78.838 us 78.900 us]
                        change: [-0.0704% +0.0808% +0.2280%] (p = 0.29 > 0.05)

to:

datetime/datetime_lengths                        
                        time:   [86.909 us 87.470 us 88.559 us]
                        change: [+10.670% +11.391% +12.932%] (p = 0.00 < 0.05)

I believe that if we are to continue on this path I'd need a peer to analyze the counter-intuitive results and validate them or find a flaw in my implementation.

Nov 17 '21 23:11 zbraniecki

@sffc to review and @gregtatum to run a profile on the before and after.

Nov 18 '21 19:11 sffc

I can validate that this patch saves 159 bytes:

before: https://share.firefox.dev/3kRdluX after: https://share.firefox.dev/3oPM9Oq

icu_datetime::pattern::runtime::generic::GenericPattern::combined is removed.

That would be this line here:

https://github.com/unicode-org/icu4x/pull/1311/files#diff-de8ea507b9257d7f7d1ae0acd2b49b7b391c3e56302f8e2c729cc8518eee5469L49

Nov 19 '21 14:11 gregtatum

There is a huge amount of variance in the tests, which looking at it, it appears that malloc is the issue. I changed the benchmark to be single threaded so that it was easier to profile. It was tough to find anything conclusive, but I did find 70 samples that were clearly different in the "after" profile.

Here are the profiles:

Before 1: https://share.firefox.dev/3oGIked 1426 samples Before 2: https://share.firefox.dev/3cs6bZd 1397 samples After 1: https://share.firefox.dev/3DL5Ixe 1537 samples After 2: https://share.firefox.dev/3cwrSYa 1461 samples

Here are the 70 samples I found in the "after" that were new: https://share.firefox.dev/3oIjDhw

Each roughly had half of the self time of the 70 samples. icu_datetime::pattern::item::ule::::from_unaligned icu_datetime::pattern::runtime::iter::PatternKindItemIterator as core::iter::traits::iterator::Iterator>::next

Nov 19 '21 14:11 gregtatum

Interesting! Thank you! So basically, in this case, our tests forces more ULE operations. I'm wondering why, maybe I do ULE/de-ULE conversions unnecessairly?

Nov 19 '21 15:11 zbraniecki

A few notes:

Although 18,818 to 18,659 seems like a small fraction, the fraction will be quite a bit bigger after #1058 is merged.

From Greg's flame graphs, it seems that the main change is that previously you were calling

core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
<icu_datetime::format::datetime::FormattedDateTime<T> as core::fmt::Display>::fmt
core::fmt::write
<core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
criterion::routine::Routine::test
criterion::benchmark_group::BenchmarkGroup<M>::bench_function
datetime::datetime_benches::{{closure}}

but now you are calling

<icu_datetime::pattern::runtime::iter::PatternKindItemIterator as core::iter::traits::iterator::Iterator>::next
<icu_datetime::format::datetime::FormattedDateTime<T> as core::fmt::Display>::fmt
core::fmt::write
<core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
criterion::routine::Routine::test
criterion::benchmark_group::BenchmarkGroup<M>::bench_function
datetime::datetime_benches::{{closure}}

I don't know what the FnOnce from the first sample is supposed to be.

Finally, perhaps the biggest thing that surprises me from the flame graph is that Postcard is calling malloc while deserializing. I do not know why it is doing this. It's almost as if the data is being deserialized into an Owned pattern instead of a Borrowed pattern, which could explain some of the reported performance differences, if we are now changing to use more borrowing.

To be clear, I'm talking about this:

alloc::raw_vec::RawVec<T,A>::allocate_in
<serde::de::impls::ArrayVisitor<[T; 12]> as serde::de::Visitor>::visit_seq
<&mut postcard::de::deserializer::Deserializer as serde::de::Deserializer>::deserialize_struct
<&mut postcard::de::deserializer::Deserializer as serde::de::Deserializer>::deserialize_struct
<&mut postcard::de::deserializer::Deserializer as serde::de::Deserializer>::deserialize_struct
<icu_provider_blob::static_data_provider::StaticDataProvider as icu_provider::data_provider::DataProvider<M>>::load_payload
icu_datetime::datetime::DateTimeFormat::try_new
<core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
criterion::routine::Routine::test
criterion::benchmark_group::BenchmarkGroup<M>::bench_function
datetime::datetime_benches::{{closure}}

This is pre-existing, but it may explain some of the unintuitive results.

Nov 20 '21 05:11 sffc

All committers have signed the CLA.

Sep 01 '22 06:09 CLAassistant

icu4x icu4x copied to clipboard

DateTimeFormat CombinedPattern

icu4x
icu4x copied to clipboard