chrono Explore using icu4

Relates to #854, but doesn't replace datetime code as a mentioned possible approach.

Uses icu4 for:

short_months
long_months
short_weekdays
long_weekdays
am_pm
decimal_point

Still uses pure_rust_locales for (I couldn't figure out how to get this from icu4 yet...):

d_fmt
d_t_fmt
t_fmt
t_fmt_ampm

Thanks for contributing to chrono!

If your feature is semver-compatible, please target the 0.4.x branch; the main branch will be used for 0.5.0 development going forward.

Please consider adding a test to ensure your bug fix/feature will not break in the future.

Aug 26 '23 20:08 bim9262

Codecov Report

Merging #1227 (c83af79) into 0.4.x (7eb6db1) will decrease coverage by 0.11%. The diff coverage is 86.69%.

@@            Coverage Diff             @@
##            0.4.x    #1227      +/-   ##
==========================================
- Coverage   86.14%   86.03%   -0.11%     
==========================================
  Files          37       37              
  Lines       13394    13663     +269     
==========================================
+ Hits        11538    11755     +217     
- Misses       1856     1908      +52

Files Changed	Coverage Δ
src/format/locales.rs	`82.31% <81.36%> (-17.69%)`	:arrow_down:
src/format/formatting.rs	`94.78% <96.80%> (+0.05%)`	:arrow_up:
src/datetime/tests.rs	`96.44% <100.00%> (ø)`
src/format/strftime.rs	`98.94% <100.00%> (ø)`

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Aug 26 '23 20:08 codecov[bot]

What's the benefit here? How does it impact compile sizes and binary sizes?

Aug 26 '23 20:08 djc

My end goal is to support #281 and to improve localization overall - although I suppose it is subjective if the data from icu is better than the data in pure_rust_locales). In the tests you can see that there's some discrepencies in the localization data between the two.

For the compile size are you referring to the size of libchrono.rlib?

I made a little test wrapper that depends on chrono ( chrono = { version = "0.4.26", features = ["unstable-locales"] }) and compared the size with and without the icu addition.

without icu the bin is 4.6M with icu the bin is 7.5M.

Aug 26 '23 22:08 bim9262

So a 3M binary size penalty seems like a big downside...

Aug 28 '23 07:08 djc

My end goal is to support #281 and to improve localization overall - although I suppose it is subjective if the data from icu is better than the data in pure_rust_locales).

Good that you look at localization overall and not just adding "nd", "th" suffixes :+1:. That looks to English-focused to me, as there are similar but different shortcomings when formatting in other languages.

But I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.

Do we want to combine some localization data from ICU, while lacking their formatting API, with the formatting string format from libc?

Given that we use a strftime formatting string like libc, it makes sense to me that we also use the localization data from libc. Our localization would be 'as good as', and match, libc. That is already a useful thing for the Rust ecosystem. And when you want more advanced formatting, ICU4X would be the way to go.

Aug 28 '23 10:08 pitdicker

So a 3M binary size penalty seems like a big downside...

We already don't include timezone data but leave it in chrono-tz because it would add 2+ Mb to the binary size (and there is the problem it can get outdated...). I would argue that that data is a lot closer to the core functionality of chrono.

Aug 28 '23 10:08 pitdicker

Some links for a bit of context/history of our localization support:

https://github.com/chronotope/chrono/pull/453
https://github.com/chronotope/chrono/issues/475
https://github.com/chronotope/chrono/pull/1152
https://github.com/chronotope/chrono/pull/1165

Some of the goals where to not generate the data at build time, to not increase the binary size with locale data that is not used, and to work without allocating.

Aug 28 '23 11:08 pitdicker

I'll take a look at generating rust files from the icu data that only has the data we'd need. For example there was a lot of extra information included about non-gregorian calendars. Producing data in a format like pure-rust-locales should allow us to have &'static strs, and not need to alloc. This should also cut down on the size of the additional allocation information. :crossed_fingers:

Aug 28 '23 22:08 bim9262

Upon further testing by dumping out some mods from the ICU data I see results like:

#[allow(non_snake_case,non_camel_case_types,dead_code,unused_imports)]
pub mod aa_DJ {
    pub const SHORT_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const LONG_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const SHORT_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const LONG_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const AM_PM: &[&str] = &["AM", "PM"];
    pub const DECIMAL_POINT: &str = ".";
    pub const DIGITS: &[char] = &['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'];
}

Since there the data for months doesn't look good. I don't think that it would be a good to use icu for months (and probably not weekdays either)

The data for decimal and digits looks promising (the digits are displayed 9-0 on github, even though they are 0-9 in my terminal. Probably due to left to right vs right to left. As a side note, although arabic is read right to left, the numbers are constructed left to right):

pub mod ar_SY {
    [...]
    pub const DECIMAL_POINT: &str = "٫";
    pub const DIGITS: &[char] = &['٠', '١', '٢', '٣', '٤', '٥', '٦', '٧', '٨', '٩'];
}

Still need to investigate the ordinal numbers...

Aug 29 '23 05:08 bim9262

This shows a comparison between the glibc and icu for am_pm and the decimal point

https://gist.github.com/bim9262/920d29bd6155e207239bf11f42fe5da6

There are a lot of interesting cases where glibc has something more specific than am/pm or icu will be more specific.

Some examples to check out:

hsb_DE (glibc blank, icu localized)
ig_NG (glibc am/pm, icu localized)
hif_FJ (glibc localized, icu am/pm)
ks_IN (both localized, but very different)

I'd be tempted to take use glibc unless it's blank or am/pm then try icu as the fallback.

Aug 29 '23 07:08 bim9262

But I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.

ICU has a set of patterns for date & time components. See https://cldr.unicode.org/translation/date-time/date-time-patterns

It covers a lot more cases than the strftime formatting specifiers. You can use the ICU data with the strftime specifiers as well. You just have to make some decisions about the mapping. I did this with my Perl module DateTime. I could enumerate the choices I made there if there's interest in this.

Feb 11 '24 05:02 autarch

Upon further testing by dumping out some mods from the ICU data I see results like:

#[allow(non_snake_case,non_camel_case_types,dead_code,unused_imports)]
pub mod aa_DJ {
    pub const SHORT_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const LONG_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const SHORT_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const LONG_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const AM_PM: &[&str] = &["AM", "PM"];
    pub const DECIMAL_POINT: &str = ".";
    pub const DIGITS: &[char] = &['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'];
}

Since there the data for months doesn't look good. I don't think that it would be a good to use icu for months (and probably not weekdays either)

This sort of thing happens when the ICU project's data for a locale is incomplete. You can see the underlying data for that locale at https://github.com/unicode-org/cldr/blob/main/common/main/aa_DJ.xml. It's nearly empty. Compare it to a more complete locale at https://github.com/unicode-org/cldr/blob/main/common/main/am.xml.

There's a whole system for locale "inheritance". In the case of aa-DJ it should fall back to aa, which does have data (at least in the CLDR GitHub repo). But the output above looks like it's falling back to the root locale. I'm not sure how you dumped that data. Or maybe it's using an older version of the data where the aa locale was empty.

Feb 11 '24 05:02 autarch

One more thing re: CLDR formatting. If you want to support this I think it's best to come up with new functions that accept these formats, like format_cldr or something along those lines.

Feb 11 '24 05:02 autarch

So a 3M binary size penalty seems like a big downside...

I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.

Do we want to combine some localization data from ICU, while lacking their formatting API, with the formatting string format from libc?

Given that we use a strftime formatting string like libc, it makes sense to me that we also use the localization data from libc. Our localization would be 'as good as', and match, libc. That is already a useful thing for the Rust ecosystem. And when you want more advanced formatting, ICU4X would be the way to go

In my opinion chrono should focus mostly on being a good date-and-time library. ICU can be better at localization. I propose to close this PR and instead make it easy to convert to/from icu::calendar::{Date, DateTime}. Maybe behind an icu feature?

Apr 10 '24 12:04 pitdicker

chrono chrono copied to clipboard

Explore using icu4

Thanks for contributing to chrono!

Codecov Report

chrono
chrono copied to clipboard