pytorch-forecasting icon indicating copy to clipboard operation
pytorch-forecasting copied to clipboard

Using ``int`` target values instead of ``float`` caused unexpected error: KeyError: "Unknown category '38' encountered. Set `add_nan=True` to allow unknown categories"

Open geronimos opened this issue 3 years ago • 0 comments

  • PyTorch-Forecasting version: 0.10.1
  • PyTorch version: 1.11.0
  • Python version: 3.9.12
  • Operating System: Ubuntu 20.04 (WSL 2 on Windows 10)

Expected behaviour

I tried to create a TimeSeriesDataSet and a DataLoader based on a simple DataFrame filled with dummy data as also presented in the tutorial. I expected this to be no big deal. I observed an unexpected error message when using int values instead of float.

Actual behaviour

However, the result was an unexpected error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py:132, in NaNLabelEncoder.transform(self, y, return_norm, target_scale, ignore_na)
    [131](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=130) try:
--> [132](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=131)     encoded = [self.classes_[v] for v in y]
    [133](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=132) except KeyError as e:

File ~/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py:132, in <listcomp>(.0)
    [131](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=130) try:
--> [132](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=131)     encoded = [self.classes_[v] for v in y]
    [133](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=132) except KeyError as e:

KeyError: 3[8](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/bergk/geronimos/pytorch-forecasting/docs/source/tutorials/bug-report.ipynb#ch0000033vscode-remote?line=7)

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/home/bergk/geronimos/pytorch-forecasting/docs/source/tutorials/bug-report.ipynb Cell [10](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/bergk/geronimos/pytorch-forecasting/docs/source/tutorials/bug-report.ipynb#ch0000033vscode-remote?line=9)' in <cell line: 22>()
      8 prediction_length = max_prediction_length
     10 training = TimeSeriesDataSet(
     [11](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/bergk/geronimos/pytorch-forecasting/docs/source/tutorials/bug-report.ipynb#ch0000033vscode-remote?line=10)     data[lambda x: x.time_idx <= training_cutoff],
     [12](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/bergk/geronimos/pytorch-forecasting/docs/source/tutorials/bug-report.ipynb#ch0000033vscode-remote?line=11)     time_idx="time_idx",
   (...)
     [19](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/bergk/geronimos/pytorch-forecasting/docs/source/tutorials/bug-report.ipynb#ch0000033vscode-remote?line=18)     max_prediction_length=prediction_length,
     [20](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/bergk/geronimos/pytorch-forecasting/docs/source/tutorials/bug-report.ipynb#ch0000033vscode-remote?line=19) )
---> [22](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/bergk/geronimos/pytorch-forecasting/docs/source/tutorials/bug-report.ipynb#ch0000033vscode-remote?line=21) validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training_cutoff + 1)
     [23](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/bergk/geronimos/pytorch-forecasting/docs/source/tutorials/bug-report.ipynb#ch0000033vscode-remote?line=22) batch_size = 128
     [24](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/bergk/geronimos/pytorch-forecasting/docs/source/tutorials/bug-report.ipynb#ch0000033vscode-remote?line=23) train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)

File ~/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py:1112, in TimeSeriesDataSet.from_dataset(cls, dataset, data, stop_randomization, predict, **update_kwargs)
   [1091](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1090) @classmethod
   [1092](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1091) def from_dataset(
   [1093](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1092)     cls, dataset, data: pd.DataFrame, stop_randomization: bool = False, predict: bool = False, **update_kwargs
   [1094](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1093) ):
   [1095](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1094)     """
   [1096](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1095)     Generate dataset with different underlying data but same variable encoders and scalers, etc.
   [1097](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1096) 
   (...)
   [1110](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1109)         TimeSeriesDataSet: new dataset
   [1111](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1110)     """
-> [1112](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1111)     return cls.from_parameters(
   [1113](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1112)         dataset.get_parameters(), data, stop_randomization=stop_randomization, predict=predict, **update_kwargs
   [1114](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1113)     )

File ~/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py:1158, in TimeSeriesDataSet.from_parameters(cls, parameters, data, stop_randomization, predict, **update_kwargs)
   [1155](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1154)     parameters["randomize_length"] = None
   [1156](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1155) parameters.update(update_kwargs)
-> [1158](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1157) new = cls(data, **parameters)
   [1159](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=1158) return new

File ~/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py:434, in TimeSeriesDataSet.__init__(self, data, time_idx, target, group_ids, weight, max_encoder_length, min_encoder_length, min_prediction_idx, min_prediction_length, max_prediction_length, static_categoricals, static_reals, time_varying_known_categoricals, time_varying_known_reals, time_varying_unknown_categoricals, time_varying_unknown_reals, variable_groups, constant_fill_strategy, allow_missing_timesteps, lags, add_relative_time_idx, add_target_scales, add_encoder_length, target_normalizer, categorical_encoders, scalers, randomize_length, predict_mode)
    [431](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=430) data = data.sort_values(self.group_ids + [self.time_idx])
    [433](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=432) # preprocess data
--> [434](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=433) data = self._preprocess_data(data)
    [435](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=434) for target in self.target_names:
    [436](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=435)     assert target not in self.scalers, "Target normalizer is separate and not in scalers."

File ~/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py:747, in TimeSeriesDataSet._preprocess_data(self, data)
    [744](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=743)             data[f"__target__{target}"] = data[target]
    [746](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=745) elif isinstance(self.target_normalizer, NaNLabelEncoder):
--> [747](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=746)     data[self.target] = self.target_normalizer.transform(data[self.target])
    [748](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=747)     # overwrite target because it requires encoding (continuous targets should not be normalized)
    [749](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/timeseries.py?line=748)     data[f"__target__{self.target}"] = data[self.target]

File ~/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py:134, in NaNLabelEncoder.transform(self, y, return_norm, target_scale, ignore_na)
    [132](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=131)             encoded = [self.classes_[v] for v in y]
    [133](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=132)         except KeyError as e:
--> [134](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=133)             raise KeyError(
    [135](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=134)                 f"Unknown category '{e.args[0]}' encountered. Set `add_nan=True` to allow unknown categories"
    [136](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=135)             )
    [138](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=137) if isinstance(y, torch.Tensor):
    [139](file:///home/bergk/geronimos/pytorch-forecasting/pytorch_forecasting/data/encoders.py?line=138)     encoded = torch.tensor(encoded, dtype=torch.long, device=y.device)

KeyError: "Unknown category '38' encountered. Set `add_nan=True` to allow unknown categories"

I think it has to do with the data type of the target values because I was using int instead of float.

Code to reproduce the problem

# Compare to: https://pytorch-forecasting.readthedocs.io/en/stable/tutorials/building.html#Passing-data-to-a-model
import numpy as np
import pandas as pd

data = pd.DataFrame(
    dict(
        # Create integer values instead of float
        value=[np.random.randint(100) for i in range(30)], # value=(np.random.rand(30) - 0.5),
        group=np.repeat(np.arange(3), 10),
        time_idx=np.tile(np.arange(10), 3),
    )
)

# Compare to: https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/ar.html
# create dataset and dataloaders
max_encoder_length = 5
max_prediction_length = 2

training_cutoff = data["time_idx"].max() - max_prediction_length

context_length = max_encoder_length
prediction_length = max_prediction_length

training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    time_idx="time_idx",
    target="value",
    group_ids=["group"],
    categorical_encoders={"group": NaNLabelEncoder(add_nan=True).fit(data.group)},
    # only unknown variable is "value" - and N-Beats can also not take any additional variables
    time_varying_unknown_reals=["value"],
    max_encoder_length=context_length,
    max_prediction_length=prediction_length,
)

validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training_cutoff + 1)
batch_size = 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=0)

A similar issue has been already discussed here: https://stackoverflow.com/questions/71098518/unknown-category-2-encountered-set-add-nan-true-to-allow-unknown-categories

Here is a colab snippet showing the error: https://colab.research.google.com/drive/1uw-W6SGBLHQF3JQYwpeHS8sPoPY6ZaRP?usp=sharing

geronimos avatar Apr 15 '22 21:04 geronimos