butterfree Make pipelines aware of a timezone configuration

Make pipelines aware of a timezone configuration

Open roelschr opened this issue 3 years ago • 1 comments

Why? :open_book:

While Spark's TimestampType timezone is controlled by the spark.sql.session.timeZone configuration option, python's datetime objects have their timezone controlled by the system's timezone (when they don't have a fixed tz suffix). This means some transformations can have their timestamps converted in different ways when running on different systems.

An example of possible irregular results happens when we automatically set the start_date of AggregatedFeatureSets (here). Sometimes the spark and the system can have different timezones, meaning that the timestamp coming from the spark dataframe, when collected into plain python as a datetime object can change, generating a start_date different then expected.

What? :wrench:

This PR proposes to apply a timezone configuration that should be aware by each pipeline and that should be the same between spark and system. This timezone is configurable.

Type of change

Please delete options that are not relevant.

[x] New feature (non-breaking change which adds functionality)
[x] This change requires a documentation update

How everything was tested? :straight_ruler:

TODO.

Checklist

[x] My code follows the style guidelines of this project (docstrings, type hinting and linter compliance);
[x] I have performed a self-review of my own code;
[x] I have made corresponding changes to the documentation;
[ ] I have added tests that prove my fix is effective or that my feature works;
[x] New and existing unit tests pass locally with my changes;
[x] Add labels to distinguish the type of pull request. Available labels are bug, enhancement, feature, and review.