pytorch_tabular icon indicating copy to clipboard operation
pytorch_tabular copied to clipboard

ValueError: y contains previously unseen labels

Open syuu1987 opened this issue 7 months ago • 2 comments

Describe the bug I have two models. Model tabular_binary_model is trained by binay_label, Model tabular_multi_cls_model is trained by label. Both label and binay_label are in df_test.

I run the code as below,

tabular_binary_model = TabularModel.load_model("gandalf_emb_exp_22_3_binary_010")
df_pred = tabular_binary_model.predict(df_test)
tabular_multi_cls_model = TabularModel.load_model("gandalf_exp_22_1")
df_multi_pred = tabular_multi_cls_model.predict(df_test)

I have got the error,

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[27], line 1
----> 1 df_multi_pred = tabular_multi_cls_model.predict(df_test)

File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_model.py:1514](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_model.py#line=1513), in TabularModel.predict(self, test, quantiles, n_samples, ret_logits, include_input_features, device, progress_bar, test_time_augmentation, num_tta, alpha_tta, aggregate_tta, tta_seed)
   1512     handle.remove()
   1513 else:
-> 1514     pred_df = self._predict(
   1515         test,
   1516         quantiles,
   1517         n_samples,
   1518         ret_logits,
   1519         include_input_features,
   1520         device,
   1521         progress_bar,
   1522     )
   1523 return pred_df

File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_model.py:1372](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_model.py#line=1371), in TabularModel._predict(self, test, quantiles, n_samples, ret_logits, include_input_features, device, progress_bar)
   1370         model = self.model.to(device)
   1371 model.eval()
-> 1372 inference_dataloader = self.datamodule.prepare_inference_dataloader(test)
   1373 is_probabilistic = hasattr(model.hparams, "_probabilistic") and model.hparams._probabilistic
   1375 if progress_bar == "rich":

File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py:861](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py#line=860), in TabularDatamodule.prepare_inference_dataloader(self, df, batch_size, copy_df)
    859 if copy_df:
    860     df = df.copy()
--> 861 df = self._prepare_inference_data(df)
    862 dataset = TabularDataset(
    863     task=self.config.task,
    864     data=df,
   (...)
    867     target=(self.target if all(col in df.columns for col in self.target) else None),
    868 )
    869 return DataLoader(
    870     dataset,
    871     batch_size or self.batch_size,
   (...)
    874     **self.config.dataloader_kwargs,
    875 )

File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py:843](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py#line=842), in TabularDatamodule._prepare_inference_data(self, df)
    841     else:
    842         df.loc[:, self.target] = np.zeros((len(df), len(self.target)))
--> 843 df, _ = self.preprocess_data(df, stage="inference")
    844 return df

File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py:463](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py#line=462), in TabularDatamodule.preprocess_data(self, data, stage)
    461     data = self._normalize_continuous_columns(data, stage)
    462 # Converting target labels to a 0 indexed label
--> 463 data = self._label_encode_target(data, stage)
    464 # Target Transforms
    465 data = self._target_transform(data, stage)

File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py:404](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py#line=403), in TabularDatamodule._label_encode_target(self, data, stage)
    402     for i in range(len(self.config.target)):
    403         if self.config.target[i] in data.columns:
--> 404             data[self.config.target[i]] = self.label_encoder[i].transform(data[self.config.target[i]])
    405 return data

File [/usr/local/lib/python3.8/dist-packages/sklearn/preprocessing/_label.py:137](http://10.253.0.240:8883/usr/local/lib/python3.8/dist-packages/sklearn/preprocessing/_label.py#line=136), in LabelEncoder.transform(self, y)
    134 if _num_samples(y) == 0:
    135     return np.array([])
--> 137 return _encode(y, uniques=self.classes_)

File [/usr/local/lib/python3.8/dist-packages/sklearn/utils/_encode.py:232](http://10.253.0.240:8883/usr/local/lib/python3.8/dist-packages/sklearn/utils/_encode.py#line=231), in _encode(values, uniques, check_unknown)
    230     diff = _check_unknown(values, uniques)
    231     if diff:
--> 232         raise ValueError(f"y contains previously unseen labels: {str(diff)}")
    233 return np.searchsorted(uniques, values)

ValueError: y contains previously unseen labels: [1.0]

I have checked label is not in continuous_cols. If I drop the columns of label, it does works.

Image

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots

Image Image Image

Desktop (please complete the following information):

  • OS: Amazon Linux
  • Browser chrome
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context Add any other context about the problem here.

syuu1987 avatar May 08 '25 01:05 syuu1987

Hello @syuu1987! I'm here to help you with your bug and answer any questions you might have. While you're waiting for a human maintainer, feel free to ask me anything!

The error you're encountering, "ValueError: y contains previously unseen labels," is due to the LabelEncoder in the TabularDatamodule class encountering labels during inference that were not present during training. The LabelEncoder does not handle unseen labels by default, which results in this error [1].

To resolve this issue, you can consider the following approaches:

  1. Ensure Consistent Labels: Make sure that the labels in your test data (df_test) are consistent with those seen during training. This means that all labels present in the test data should have been present in the training data.

  2. Handle Unseen Labels: Modify the behavior of the LabelEncoder to handle unseen labels. You can do this by configuring the handle_unseen parameter in the BaseEncoder class to options like 'ignore' or 'impute' instead of 'error'. This will allow the model to skip or impute unseen labels rather than raising an error [2].

  3. Drop Unnecessary Columns: If the 'label' column is not needed for the prediction, as you've already tried, you can drop it from the test DataFrame before making predictions.

By implementing one of these solutions, you should be able to avoid the error and proceed with your predictions.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar May 08 '25 01:05 dosubot[bot]

This does look like a minor bug. Can you share a reproduce-able and self contained example? I'll check it out

manujosephv avatar May 16 '25 04:05 manujosephv

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 15 '25 04:07 stale[bot]