chrono
chrono copied to clipboard
Explore using icu4
Relates to #854, but doesn't replace datetime code as a mentioned possible approach.
Uses icu4 for:
- short_months
- long_months
- short_weekdays
- long_weekdays
- am_pm
- decimal_point
Still uses pure_rust_locales for (I couldn't figure out how to get this from icu4 yet...):
- d_fmt
- d_t_fmt
- t_fmt
- t_fmt_ampm
Thanks for contributing to chrono!
If your feature is semver-compatible, please target the 0.4.x branch; the main branch will be used for 0.5.0 development going forward.
Please consider adding a test to ensure your bug fix/feature will not break in the future.
Codecov Report
Merging #1227 (c83af79) into 0.4.x (7eb6db1) will decrease coverage by
0.11%. The diff coverage is86.69%.
@@ Coverage Diff @@
## 0.4.x #1227 +/- ##
==========================================
- Coverage 86.14% 86.03% -0.11%
==========================================
Files 37 37
Lines 13394 13663 +269
==========================================
+ Hits 11538 11755 +217
- Misses 1856 1908 +52
| Files Changed | Coverage Δ | |
|---|---|---|
| src/format/locales.rs | 82.31% <81.36%> (-17.69%) |
:arrow_down: |
| src/format/formatting.rs | 94.78% <96.80%> (+0.05%) |
:arrow_up: |
| src/datetime/tests.rs | 96.44% <100.00%> (ø) |
|
| src/format/strftime.rs | 98.94% <100.00%> (ø) |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
What's the benefit here? How does it impact compile sizes and binary sizes?
My end goal is to support #281 and to improve localization overall - although I suppose it is subjective if the data from icu is better than the data in pure_rust_locales). In the tests you can see that there's some discrepencies in the localization data between the two.
For the compile size are you referring to the size of libchrono.rlib?
I made a little test wrapper that depends on chrono ( chrono = { version = "0.4.26", features = ["unstable-locales"] }) and compared the size with and without the icu addition.
without icu the bin is 4.6M with icu the bin is 7.5M.
So a 3M binary size penalty seems like a big downside...
My end goal is to support #281 and to improve localization overall - although I suppose it is subjective if the data from icu is better than the data in pure_rust_locales).
Good that you look at localization overall and not just adding "nd", "th" suffixes :+1:. That looks to English-focused to me, as there are similar but different shortcomings when formatting in other languages.
But I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.
Do we want to combine some localization data from ICU, while lacking their formatting API, with the formatting string format from libc?
Given that we use a strftime formatting string like libc, it makes sense to me that we also use the localization data from libc. Our localization would be 'as good as', and match, libc. That is already a useful thing for the Rust ecosystem. And when you want more advanced formatting, ICU4X would be the way to go.
So a 3M binary size penalty seems like a big downside...
We already don't include timezone data but leave it in chrono-tz because it would add 2+ Mb to the binary size (and there is the problem it can get outdated...). I would argue that that data is a lot closer to the core functionality of chrono.
Some links for a bit of context/history of our localization support:
- https://github.com/chronotope/chrono/pull/453
- https://github.com/chronotope/chrono/issues/475
- https://github.com/chronotope/chrono/pull/1152
- https://github.com/chronotope/chrono/pull/1165
Some of the goals where to not generate the data at build time, to not increase the binary size with locale data that is not used, and to work without allocating.
I'll take a look at generating rust files from the icu data that only has the data we'd need. For example there was a lot of extra information included about non-gregorian calendars. Producing data in a format like pure-rust-locales should allow us to have &'static strs, and not need to alloc. This should also cut down on the size of the additional allocation information. :crossed_fingers:
Upon further testing by dumping out some mods from the ICU data I see results like:
#[allow(non_snake_case,non_camel_case_types,dead_code,unused_imports)]
pub mod aa_DJ {
pub const SHORT_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
pub const LONG_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
pub const SHORT_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
pub const LONG_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
pub const AM_PM: &[&str] = &["AM", "PM"];
pub const DECIMAL_POINT: &str = ".";
pub const DIGITS: &[char] = &['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'];
}
Since there the data for months doesn't look good. I don't think that it would be a good to use icu for months (and probably not weekdays either)
The data for decimal and digits looks promising (the digits are displayed 9-0 on github, even though they are 0-9 in my terminal. Probably due to left to right vs right to left. As a side note, although arabic is read right to left, the numbers are constructed left to right):
pub mod ar_SY {
[...]
pub const DECIMAL_POINT: &str = "٫";
pub const DIGITS: &[char] = &['٠', '١', '٢', '٣', '٤', '٥', '٦', '٧', '٨', '٩'];
}
Still need to investigate the ordinal numbers...
This shows a comparison between the glibc and icu for am_pm and the decimal point
https://gist.github.com/bim9262/920d29bd6155e207239bf11f42fe5da6
There are a lot of interesting cases where glibc has something more specific than am/pm or icu will be more specific.
Some examples to check out:
- hsb_DE (glibc blank, icu localized)
- ig_NG (glibc am/pm, icu localized)
- hif_FJ (glibc localized, icu am/pm)
- ks_IN (both localized, but very different)
I'd be tempted to take use glibc unless it's blank or am/pm then try icu as the fallback.
But I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.
ICU has a set of patterns for date & time components. See https://cldr.unicode.org/translation/date-time/date-time-patterns
It covers a lot more cases than the strftime formatting specifiers. You can use the ICU data with the strftime specifiers as well. You just have to make some decisions about the mapping. I did this with my Perl module DateTime. I could enumerate the choices I made there if there's interest in this.
Upon further testing by dumping out some mods from the ICU data I see results like:
#[allow(non_snake_case,non_camel_case_types,dead_code,unused_imports)] pub mod aa_DJ { pub const SHORT_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"]; pub const LONG_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"]; pub const SHORT_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]; pub const LONG_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]; pub const AM_PM: &[&str] = &["AM", "PM"]; pub const DECIMAL_POINT: &str = "."; pub const DIGITS: &[char] = &['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']; }Since there the data for months doesn't look good. I don't think that it would be a good to use icu for months (and probably not weekdays either)
This sort of thing happens when the ICU project's data for a locale is incomplete. You can see the underlying data for that locale at https://github.com/unicode-org/cldr/blob/main/common/main/aa_DJ.xml. It's nearly empty. Compare it to a more complete locale at https://github.com/unicode-org/cldr/blob/main/common/main/am.xml.
There's a whole system for locale "inheritance". In the case of aa-DJ it should fall back to aa, which does have data (at least in the CLDR GitHub repo). But the output above looks like it's falling back to the root locale. I'm not sure how you dumped that data. Or maybe it's using an older version of the data where the aa locale was empty.
One more thing re: CLDR formatting. If you want to support this I think it's best to come up with new functions that accept these formats, like format_cldr or something along those lines.
So a 3M binary size penalty seems like a big downside...
I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.
Do we want to combine some localization data from ICU, while lacking their formatting API, with the formatting string format from libc?
Given that we use a strftime formatting string like libc, it makes sense to me that we also use the localization data from libc. Our localization would be 'as good as', and match, libc. That is already a useful thing for the Rust ecosystem. And when you want more advanced formatting, ICU4X would be the way to go
In my opinion chrono should focus mostly on being a good date-and-time library. ICU can be better at localization. I propose to close this PR and instead make it easy to convert to/from icu::calendar::{Date, DateTime}. Maybe behind an icu feature?