Add One-Hot Encoder to Time Axis Encoders
Is your feature request related to a current problem?
Some models may perform better with some datetime attributes being one hot encoded. Currently, we have to use the datetime_attribute_timeseries() function to generate these encodings. Using this function for only one-hot encoded attributes and the add_encoder parameter for others requires more custom code from users.
Describe proposed solution
Adding a OneHotTemporalEncoder class which generates one hot encodings for date time attributes. The encoder should be used by SequentialEncoder and by the add_encoder parameter of models.
Hi @konsram and thanks for opening this issue. We haven't added the one-hot encoded encoder (yet) because the number of generated features may differ based on the forecast horizon / output chunk during inference.
E.g. assume you want to one-hot encode the month you have a series with hourly frequency spanning over 2 years. Now you want to forecast the next 24 hours.
During training the one hot encoding works fine and generates 12 month-features (or 11). But during inference, there would only be a single month feature (from the 24 hour horizon).
We would have to write some logic to handle this case.
I can add it to our backlog. Would this be something you would like to contribute? :)
Hi @dennisbader, thanks for your reply. Wouldn't it be possible to implement a one-hot encoder similar to the CyclicTemporalEncoder, using the datetime_attribute_timeseries() function? In this case the implementations of accept_transformer() and encoding_n_components() should depend on the selected attribute and the frequency of the time index (e.g. not generating 60 features for the attribute minute when we have a frequency of 15 minutes).
I don't quite understand why the single month feature in your example would be a problem, except that the value is rather static in this specific case.
Hi @konsram, the number of features that the model was trained on must be equal to the number of features used for prediction. In my example above, the training set would contain 12 features for all months (because the data covers two years, hence we observe all 12 months). During prediction, the encoder would only generate one feature for the month X (because the horizon only covers 24 hours).
So we would have to make sure that even in that case, the encoder would still add all remaining month features (11) filled with zeros.
Of course it's possible, but I wanted to mention what we need to take into account, and why it's not "as simple" as for the other encoders.
Hi @dennisbader, thanks for the clarification. I’ll try to contribute to this, but I’ll need to see when I can get to it.
Sound great, thanks @konsram 🚀
While implementing this feature, I noticed an inconsistency in the implementation of datetime_attribute_timeseries regarding datetime attributes with a variable maximum number of unique values, such as day, dayofyear, and week. For dayofyear and week, the cyclic encoding always uses the maximum possible attribute value (365 or 366 for days, 52 or 53 for weeks) to calculate the period, depending on whether a leap year is present. In contrast, for the day attribute, the period depends on the maximum number of days in each month, so day is encoded relatively, while the other attributes are encoded absolutely. I suggest making this behavior consistent and allowing users to configure it.
Another aspect users might want to control is how one-hot-encoded attributes handle leap years. If we use the raw attributes from pandas, days or weeks after February will have different encodings depending on whether it is a leap year. For fixed holidays like New Year's Eve, this may be undesirable, as the day would be encoded either in the 365th or 366th dummy variable. Instead, users might prefer to always encode December 31st as the 366th dummy variable. What do you think about adding a parameter to control this behavior?