pytorch-forecasting icon indicating copy to clipboard operation
pytorch-forecasting copied to clipboard

Easily adding one-hot encoding

Open dorienh opened this issue 1 year ago • 5 comments

Thanks for the great library!

I was wondering if there is a straightforward way to apply one-hot encoding to the categorical varibales:

self.training_dataset = TimeSeriesDataSet(
  ...
  time_varying_known_categoricals=['pair'],
  ...
  categorical_encoders={'pair': pytorch_forecasting.data.encoders.NaNLabelEncoder(add_nan=True).fit(dataset.pair)}, ]

Instead of just NanLabelEncoder is it possible to somehow apply one-hot encoding here?

dorienh avatar Dec 02 '23 11:12 dorienh

I know I can process it beforehand, but one of my variables will have about 170 different values, so then I'd need to add each of these 170 columns as time_varying_known_categoricals. This doesn't seem optimal.

dorienh avatar Dec 03 '23 11:12 dorienh

@dorienh can you elaborate more? If a features takes 170 different values you do not want one-hot encoding anyway, you want to do embedding. If you have 170 different features you can add them in a loop. Any example will clear the problem more.

manitadayon avatar Dec 04 '23 01:12 manitadayon

Thanks for replying. True I can use a loop. There will be multiple features like this. To give some examples:

  • I will have weekday and month. These should definitely be one-hot, so was hoping that similar as defining NanEncoder I could use one-hot encoder?
  • Other features will be Currency (170 different types, string format). Do I convert to one-hot first manually? And then in the network I can indeed process them with an embedding layer. Or is there a way to define multi-embedding or so directly within the timeseries loader?

Just to clarify, I am not asking about model here. Only about how I can best define TimeSeriesDataSet(). Afterwards in the model, I just extract both and process/embed as required:

network_input = torch.cat([x["encoder_cont"], x["encoder_cat"]], dim=2)

dorienh avatar Dec 04 '23 01:12 dorienh

No, you just include them as the static categorical variables or time varying categorical variables, then the packages will perform the embedding automatically. No need for any embedding. Add static categorical or time varying categorical in the TimeSeriesDataSet.

manitadayon avatar Dec 05 '23 05:12 manitadayon

I am building my own fully custom model though, only using TimeSeriesDataset(). When I look at x["encoder_cat"], the it's only of length 1, i.e. it seems to map strings to range(1,length).

dorienh avatar Dec 05 '23 05:12 dorienh