icu4x
icu4x copied to clipboard
DateTimeFormat CombinedPattern
Fixes #1131.
This is a plow-through draft of how a CombinedPattern may look like. I got it to the point where I am able to run work_log and criterion datetime/datetime_lengths and am getting results and can verify that for work_log I am seeing CombinedPattern with three ZeroVec::Borrowed for date/time/generic.
The overall architecture seems decent and I am ready to start carving out PRs for review that overall may improve the experience, but for the combined pattern itself, I'm not sure if the change is worth it.
The memory diff is from:
[memory] > dhat: Total: 18,818 bytes in 94 blocks
[memory] > dhat: At t-gmax: 18,450 bytes in 84 blocks
[memory] > dhat: At t-end: 1,024 bytes in 1 blocks
to:
[memory] > dhat: Total: 18,659 bytes in 92 blocks
[memory] > dhat: At t-gmax: 18,411 bytes in 83 blocks
[memory] > dhat: At t-end: 1,024 bytes in 1 blocks
and for criterion from:
datetime/datetime_lengths
time: [78.778 us 78.838 us 78.900 us]
change: [-0.0704% +0.0808% +0.2280%] (p = 0.29 > 0.05)
to:
datetime/datetime_lengths
time: [86.909 us 87.470 us 88.559 us]
change: [+10.670% +11.391% +12.932%] (p = 0.00 < 0.05)
I believe that if we are to continue on this path I'd need a peer to analyze the counter-intuitive results and validate them or find a flaw in my implementation.
@sffc to review and @gregtatum to run a profile on the before and after.
I can validate that this patch saves 159 bytes:
before: https://share.firefox.dev/3kRdluX after: https://share.firefox.dev/3oPM9Oq
icu_datetime::pattern::runtime::generic::GenericPattern::combined is removed.
That would be this line here:
https://github.com/unicode-org/icu4x/pull/1311/files#diff-de8ea507b9257d7f7d1ae0acd2b49b7b391c3e56302f8e2c729cc8518eee5469L49
There is a huge amount of variance in the tests, which looking at it, it appears that malloc is the issue. I changed the benchmark to be single threaded so that it was easier to profile. It was tough to find anything conclusive, but I did find 70 samples that were clearly different in the "after" profile.
Here are the profiles:
Before 1: https://share.firefox.dev/3oGIked 1426 samples Before 2: https://share.firefox.dev/3cs6bZd 1397 samples After 1: https://share.firefox.dev/3DL5Ixe 1537 samples After 2: https://share.firefox.dev/3cwrSYa 1461 samples
Here are the 70 samples I found in the "after" that were new: https://share.firefox.dev/3oIjDhw
Each roughly had half of the self time of the 70 samples.
icu_datetime::pattern::item::ule::::from_unaligned
icu_datetime::pattern::runtime::iter::PatternKindItemIterator as core::iter::traits::iterator::Iterator>::next
Interesting! Thank you! So basically, in this case, our tests forces more ULE operations. I'm wondering why, maybe I do ULE/de-ULE conversions unnecessairly?
A few notes:
Although 18,818 to 18,659 seems like a small fraction, the fraction will be quite a bit bigger after #1058 is merged.
From Greg's flame graphs, it seems that the main change is that previously you were calling
core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
<icu_datetime::format::datetime::FormattedDateTime<T> as core::fmt::Display>::fmt
core::fmt::write
<core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
criterion::routine::Routine::test
criterion::benchmark_group::BenchmarkGroup<M>::bench_function
datetime::datetime_benches::{{closure}}
but now you are calling
<icu_datetime::pattern::runtime::iter::PatternKindItemIterator as core::iter::traits::iterator::Iterator>::next
<icu_datetime::format::datetime::FormattedDateTime<T> as core::fmt::Display>::fmt
core::fmt::write
<core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
criterion::routine::Routine::test
criterion::benchmark_group::BenchmarkGroup<M>::bench_function
datetime::datetime_benches::{{closure}}
I don't know what the FnOnce from the first sample is supposed to be.
Finally, perhaps the biggest thing that surprises me from the flame graph is that Postcard is calling malloc while deserializing. I do not know why it is doing this. It's almost as if the data is being deserialized into an Owned pattern instead of a Borrowed pattern, which could explain some of the reported performance differences, if we are now changing to use more borrowing.
To be clear, I'm talking about this:
alloc::raw_vec::RawVec<T,A>::allocate_in
<serde::de::impls::ArrayVisitor<[T; 12]> as serde::de::Visitor>::visit_seq
<&mut postcard::de::deserializer::Deserializer as serde::de::Deserializer>::deserialize_struct
<&mut postcard::de::deserializer::Deserializer as serde::de::Deserializer>::deserialize_struct
<&mut postcard::de::deserializer::Deserializer as serde::de::Deserializer>::deserialize_struct
<icu_provider_blob::static_data_provider::StaticDataProvider as icu_provider::data_provider::DataProvider<M>>::load_payload
icu_datetime::datetime::DateTimeFormat::try_new
<core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
criterion::routine::Routine::test
criterion::benchmark_group::BenchmarkGroup<M>::bench_function
datetime::datetime_benches::{{closure}}
This is pre-existing, but it may explain some of the unintuitive results.