icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

TZDB Datagen

Open nordzilla opened this issue 2 years ago • 9 comments

TZDB Datagen

Adds data generation for historic time zone transitions as well as time zone transition rules.


Historic Transitions

A multi-dimensional mapping from BCP47 TZID to daylight/GMT info to a list of historic timestamps for which a daylight savings time transition occurred.

e.g. (not representative of the data's actual format in ICU4X)

uslax => { GMT-8, DST(false) } => [ timestamp_1, timestamp_2, ... timestamp_n ]
uslax => { GMT-7, DST(true)  } => [ timestamp_1, timestamp_2, ... timestamp_n ]

Given a ZonedDateTime, the historic time zone transitions are effectively a lookup-table that can be used to determine the GMT offset and whether or not the variant is standard or daylight time at a given point in history.


Transition Rules

A mapping from BCP47 TZID to information about daylight savings transition offsets and when they occur.

e.g. (not representative of the data's actual format in ICU4X)

uslax => { 
  STD(GMT-8), 
  DST(GMT-7), 
  DSTStart {
    Month(3), 
    Week(2),
    Day(0),
    Time(2:00 AM), 
   },
  DSTEnd {
    Month(11), 
    Week(1),
    Day(0),
    Time(2:00 AM), 
   }
 }

The transition rules provides the current sets of information regarding GMT offsets as well as the the day-of-years and time-of-days when daylight savings time transitions occur in a given time zone. This can be used to determine the GMT offset and daylight variant for ZonedDateTimes in the future, as a backup in case there is no historic data available, or as an extremely lightweight dataset if an application will only be formatting current-time dates, and not dates that span into the past or future.

nordzilla avatar Dec 22 '22 21:12 nordzilla

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Dec 22 '22 21:12 CLAassistant

Notice: the branch changed across the force-push!

  • components/datetime/src/provider/tzdb/serde.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

CI is failing because --all-keys now requires the tzdb, which technically is a breaking change. We might have to bump datagen to 2.0, although then we lose the version match for baked data.

robertbastian avatar Dec 23 '22 09:12 robertbastian

Hmm, interesting question about what to do when --all-keys requires an all-new data source.

sffc avatar Dec 23 '22 23:12 sffc

Let's default --tzdb-root to /usr/share/zoneinfo and print a warning when it's used (i.e. on SourceData::tzdb. This way datagen stays usable without compiling your own time zones.

This works for Mac and Linux, but probably not for Windows. Still, better than nothing.

robertbastian avatar Jan 09 '23 13:01 robertbastian

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • components/timezone/Cargo.toml is different
  • components/timezone/src/provider/mod.rs is different
  • provider/datagen/Cargo.toml is different
  • provider/datagen/src/bin/datagen.rs is different
  • provider/datagen/src/lib.rs is different
  • provider/datagen/src/registry.rs is different
  • provider/datagen/src/source.rs is different
  • provider/datagen/src/transform/cldr/source.rs is different
  • provider/datagen/src/transform/icuexport/collator/mod.rs is different
  • provider/datagen/tests/verify-zero-copy.rs is different
  • provider/testdata/data/baked/mod.rs is different
  • provider/testdata/data/json/fingerprints.csv is different
  • provider/testdata/data/postcard/fingerprints.csv is different
  • provider/testdata/data/testdata.postcard is no longer changed in the branch
  • provider/testdata/data/tzif/Etc/GMT-4 is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • provider/testdata/data/json/fingerprints.csv is different
  • provider/testdata/data/postcard/fingerprints.csv is different
  • provider/testdata/data/testdata.postcard is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@robertbastian

I've gone through and responded to all of your feedback.

Functionality has been made experimental, and everything now uses AbstractFs, and I've tested this both with zipped data and uncompressed data.

I've also defaulted the path to /usr/share/zoneinfo which is working fine, both zipped and uncompressed.

I had to add some logic that first checks whether or not a file is intended to be TZif by checking the header (the first 4 bytes), and ignoring it otherwise, since usr/share/zoneinfo contains other types of files in addition to the TZif files.

Perhaps it would be best to only have a warning instead of a hard error if the path is unspecified. People may want to use DateTime without time zones, so they shouldn't be required to load time zones data.

Tests are now passing on CI for ubuntu and macos, with the default directory leading to /usr/share/zoneinfo, but Windows is still failing because it doesn't ship with that data.

@sffc @robertbastian

How hard would it be to point our Windows CI job at the testdata directory for only this path when testing datagen?

nordzilla avatar Jan 18 '23 19:01 nordzilla

Merged for ya

robertbastian avatar May 10 '23 14:05 robertbastian