feat: Added frequency aware one-hot and relative cyclic encoding.
Checklist before merging this PR:
- [x] Mentioned all issues that this PR fixes or addresses.
- [x] Summarized the updates of this PR under Summary.
- [ ] Added an entry under Unreleased in the Changelog.
Fixes #2842 .
Summary
Changes in this PR give users more options for encoding datetime attributes. This includes:
- Fixing the inconsistend behavior for cyclic encodings. Previously,
daywas encoded relative to the number of days in a month, while other attributes with a variable maximum (dayofyear,day_of_year,week,weekofyear,week_of_year) were encoded relative to the maximum on the specified time index. - Adding frequency aware one-hot-encodings for datetime attributes. Previously, one-hot-encodings always considered all possible values of an attribute (e.g. 60 values for
minute). Useres are now given the option to use a frequency aware one-hot-encoding. The frequency aware option considers the start of the time index and the frequency of the index to determine possible values (e.g.(0, 15, 30, 45)when start is noramlized for an hour and frequency is15min). This reduces the number of covariates, which may be critical for models who can't handle high dimensional feature spaces. - Adding a
OneHotTemporalEncoderclass, which uses the functionality from (2) and integrates intoSequentialEncoder. This requires changing the attributes available toSingleEncoder(encoders must be aware of the frequency and start time of the data). - Extending the
CyclicTemporalEncoderto reflect the changes in (1).
Other Information
Draft Progress
| Change | Implementation | Tests | Documentation |
|---|---|---|---|
| Inconsistent Cyclic Encoding | :heavy_check_mark: | :x: | :x: |
| Frequency Aware One-Hot-Encoding | :heavy_check_mark: | :x: | :x: |
OneHotTemporalEncoder |
:x: | :x: | :x: |
Changes to CyclicTemporalEncoder |
:x: | :x: | :x: |
@dennisbader What do you think about the new options for encodings mentioned in (1) and (2)? The frequency awareness for one-hot encodings does not cover all possible frequencies, but I tried to address common scenarios, and users can always use the frequency-unaware version if needed.
Do you think changing the information provided to SingleEncoders is a viable approach for (3) and (4)?
Regarding my comment in #2842: Currently I am using the raw values, what do you think about this?
Codecov Report
:x: Patch coverage is 86.04651% with 12 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 95.12%. Comparing base (8821f51) to head (924448c).
:warning: Report is 27 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| darts/utils/timeseries_generation.py | 86.04% | 12 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #2893 +/- ##
==========================================
- Coverage 95.27% 95.12% -0.15%
==========================================
Files 146 146
Lines 15588 15640 +52
==========================================
+ Hits 14851 14878 +27
- Misses 737 762 +25
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@dennisbader, what do you think about the proposed changes, especially making the encoders aware of a time series' frequency and start? Do you see any drawbacks to this approach?