darts icon indicating copy to clipboard operation
darts copied to clipboard

Better handling of categorical future covariates

Open msedluk opened this issue 2 years ago • 4 comments

I have several features in my future covariates that are categorical. First thing that I found is that the LightGBM models seems to be the only model that can handle categorical future covariates. I use the categorical_future_covariates parameter and specify the features and this is a great API feature. What I would like is for other models to support this, the most obvious would be the Linear Regression model and possibly the Temporal Fusion Transformer.

The second related issue is that when you specify future covariates lags, it calls _create_lagged_data and the code does a great job of keeping track of the categorical future that are lags and makes sure that they are also categorical before building the model.

The subtle problem is that if the model allows it, I would like the lagged categorical features to use the same embeddings as the non lagged feature. They are really the same categorical, just a time step back. The way it is, if you have 5 lags, you will have same categorical in 5 different columns, you really want them to interplay with each other when fitting the model. I understand that some models don't have this feature when dealing with categorical features.

More models that handle categorical features would be great.

I can't think of any good alternatives.

msedluk avatar Jul 27 '23 17:07 msedluk

Hey @msedluk, thank you for the request.

Extending the list of models supporting categorical covariates is tracked by https://github.com/unit8co/darts/issues/1514.

Could you further explain what you had in mind concerning Linear Regression ? Since Linear Regression is not meant to deal with categorical features, it would require categorical features to be encoded first.

Regarding the the lagged data, for the current model supporting categorical covariates (LigthGBM/CatBoost), I don't think having consistent encoding between lagged features matter. Happy to discuss this further.

jonasblanc avatar Apr 02 '25 11:04 jonasblanc

Most of the stuff that I work with seems to have categorical features. With most of these I wrote my own models using pytorch to handle the categorical features. It would be great to have more models handle categorical features. It would be great if there were regression models that encoded categorical features and then used Linear Regression. I think that would be a big win for a lot of people.

msedluk avatar Apr 02 '25 18:04 msedluk

We agree that support for categorical features for some of our TorchForecastingModel would be a great feature. This is on our roadmap already https://github.com/unit8co/darts/issues/1514.

For Linear Regression we aim at keeping data encoder and model separated. Thus as in SKlearn you would need to encode you data first then fit the model.

Here are two ways of doing it. Let's say you have categorical features represented by prime numbers, but the actual numerical values of your class labels do not matter (not sure in which scenario this would happen, but let's pretend for the sake of the example).

import pandas as pd

primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
df = pd.DataFrame({
    "target": np.random.normal(100, 10, 100),
    "cat_cov": np.random.choice(primes, 100, replace=True),
    "num_cov": np.random.normal(100, 10, 100),
})

If you wish to keep the ordinal relationship between the categories when fitting the model, then darts' dataprocessing.transformers support encoders such as sklearn OrdinalEncoder.

from darts import TimeSeries
from darts.dataprocessing.transformers import Scaler
from sklearn.preprocessing import OrdinalEncoder

# Keep the ordinal relationship between the categories
encoder = Scaler(scaler=OrdinalEncoder())
cat_series = TimeSeries.from_dataframe(df, value_cols=["cat_cov", "num_cov"])
encoded_serie = encoder.fit_transform(cat_series,component_mask=np.array([True, False]))

On the other hand, to treat your categorical data as cardinal, OneHotEncoder could be used. I believe Darts does not currently have one-to-many components dataprocessing.transformers that would be required for OneHotEncoder. However it could be easily be achieved before creating the TimeSeries:

from darts import TimeSeries

# One-hot encoding: not ordinal relationship
df = pd.get_dummies(df, columns=["cat_cov"], prefix=["cat_cov"])
encoded_serie = TimeSeries.from_dataframe(df.drop(columns="target"))

I'm guessing the one-to-many components dataprocessing.transformers could be added if there is enough demand for it.

jonasblanc avatar Apr 04 '25 08:04 jonasblanc

Thanks for this great and detailed answer @jonasblanc 🚀 I have nothing to add 💯

dennisbader avatar Apr 04 '25 09:04 dennisbader