butterfree
butterfree copied to clipboard
Make pipelines aware of a timezone configuration
Why? :open_book:
While Spark's TimestampType timezone is controlled by the spark.sql.session.timeZone
configuration option, python's datetime objects have their timezone controlled by the system's timezone (when they don't have a fixed tz suffix). This means some transformations can have their timestamps converted in different ways when running on different systems.
An example of possible irregular results happens when we automatically set the start_date
of AggregatedFeatureSets
(here). Sometimes the spark and the system can have different timezones, meaning that the timestamp coming from the spark dataframe, when collected into plain python as a datetime object can change, generating a start_date
different then expected.
What? :wrench:
This PR proposes to apply a timezone configuration that should be aware by each pipeline and that should be the same between spark and system. This timezone is configurable.
Type of change
Please delete options that are not relevant.
- [x] New feature (non-breaking change which adds functionality)
- [x] This change requires a documentation update
How everything was tested? :straight_ruler:
TODO.
Checklist
- [x] My code follows the style guidelines of this project (docstrings, type hinting and linter compliance);
- [x] I have performed a self-review of my own code;
- [x] I have made corresponding changes to the documentation;
- [ ] I have added tests that prove my fix is effective or that my feature works;
- [x] New and existing unit tests pass locally with my changes;
- [x] Add labels to distinguish the type of pull request. Available labels are
bug
,enhancement
,feature
, andreview
.
Attention Points :warning:
Replace me for what the reviewer will need to pay attention to in the PR or just to cover any concerns after the merge.
Kudos, SonarCloud Quality Gate passed!
0 Bugs
0 Vulnerabilities (and
0 Security Hotspots to review)
0 Code Smells
No Coverage information
0.0% Duplication