pytorch_tabular
pytorch_tabular copied to clipboard
ValueError: y contains previously unseen labels
Describe the bug
I have two models. Model tabular_binary_model is trained by binay_label, Model tabular_multi_cls_model is trained by label.
Both label and binay_label are in df_test.
I run the code as below,
tabular_binary_model = TabularModel.load_model("gandalf_emb_exp_22_3_binary_010")
df_pred = tabular_binary_model.predict(df_test)
tabular_multi_cls_model = TabularModel.load_model("gandalf_exp_22_1")
df_multi_pred = tabular_multi_cls_model.predict(df_test)
I have got the error,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[27], line 1
----> 1 df_multi_pred = tabular_multi_cls_model.predict(df_test)
File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_model.py:1514](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_model.py#line=1513), in TabularModel.predict(self, test, quantiles, n_samples, ret_logits, include_input_features, device, progress_bar, test_time_augmentation, num_tta, alpha_tta, aggregate_tta, tta_seed)
1512 handle.remove()
1513 else:
-> 1514 pred_df = self._predict(
1515 test,
1516 quantiles,
1517 n_samples,
1518 ret_logits,
1519 include_input_features,
1520 device,
1521 progress_bar,
1522 )
1523 return pred_df
File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_model.py:1372](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_model.py#line=1371), in TabularModel._predict(self, test, quantiles, n_samples, ret_logits, include_input_features, device, progress_bar)
1370 model = self.model.to(device)
1371 model.eval()
-> 1372 inference_dataloader = self.datamodule.prepare_inference_dataloader(test)
1373 is_probabilistic = hasattr(model.hparams, "_probabilistic") and model.hparams._probabilistic
1375 if progress_bar == "rich":
File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py:861](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py#line=860), in TabularDatamodule.prepare_inference_dataloader(self, df, batch_size, copy_df)
859 if copy_df:
860 df = df.copy()
--> 861 df = self._prepare_inference_data(df)
862 dataset = TabularDataset(
863 task=self.config.task,
864 data=df,
(...)
867 target=(self.target if all(col in df.columns for col in self.target) else None),
868 )
869 return DataLoader(
870 dataset,
871 batch_size or self.batch_size,
(...)
874 **self.config.dataloader_kwargs,
875 )
File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py:843](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py#line=842), in TabularDatamodule._prepare_inference_data(self, df)
841 else:
842 df.loc[:, self.target] = np.zeros((len(df), len(self.target)))
--> 843 df, _ = self.preprocess_data(df, stage="inference")
844 return df
File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py:463](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py#line=462), in TabularDatamodule.preprocess_data(self, data, stage)
461 data = self._normalize_continuous_columns(data, stage)
462 # Converting target labels to a 0 indexed label
--> 463 data = self._label_encode_target(data, stage)
464 # Target Transforms
465 data = self._target_transform(data, stage)
File [~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py:404](http://10.253.0.240:8883/lab/tree/dl_demo/~/.local/lib/python3.8/site-packages/pytorch_tabular/tabular_datamodule.py#line=403), in TabularDatamodule._label_encode_target(self, data, stage)
402 for i in range(len(self.config.target)):
403 if self.config.target[i] in data.columns:
--> 404 data[self.config.target[i]] = self.label_encoder[i].transform(data[self.config.target[i]])
405 return data
File [/usr/local/lib/python3.8/dist-packages/sklearn/preprocessing/_label.py:137](http://10.253.0.240:8883/usr/local/lib/python3.8/dist-packages/sklearn/preprocessing/_label.py#line=136), in LabelEncoder.transform(self, y)
134 if _num_samples(y) == 0:
135 return np.array([])
--> 137 return _encode(y, uniques=self.classes_)
File [/usr/local/lib/python3.8/dist-packages/sklearn/utils/_encode.py:232](http://10.253.0.240:8883/usr/local/lib/python3.8/dist-packages/sklearn/utils/_encode.py#line=231), in _encode(values, uniques, check_unknown)
230 diff = _check_unknown(values, uniques)
231 if diff:
--> 232 raise ValueError(f"y contains previously unseen labels: {str(diff)}")
233 return np.searchsorted(uniques, values)
ValueError: y contains previously unseen labels: [1.0]
I have checked label is not in continuous_cols.
If I drop the columns of label, it does works.
To Reproduce Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
Expected behavior A clear and concise description of what you expected to happen.
Screenshots
Desktop (please complete the following information):
- OS: Amazon Linux
- Browser chrome
- Version [e.g. 22]
Smartphone (please complete the following information):
- Device: [e.g. iPhone6]
- OS: [e.g. iOS8.1]
- Browser [e.g. stock browser, safari]
- Version [e.g. 22]
Additional context Add any other context about the problem here.
Hello @syuu1987! I'm here to help you with your bug and answer any questions you might have. While you're waiting for a human maintainer, feel free to ask me anything!
The error you're encountering, "ValueError: y contains previously unseen labels," is due to the LabelEncoder in the TabularDatamodule class encountering labels during inference that were not present during training. The LabelEncoder does not handle unseen labels by default, which results in this error [1].
To resolve this issue, you can consider the following approaches:
-
Ensure Consistent Labels: Make sure that the labels in your test data (
df_test) are consistent with those seen during training. This means that all labels present in the test data should have been present in the training data. -
Handle Unseen Labels: Modify the behavior of the
LabelEncoderto handle unseen labels. You can do this by configuring thehandle_unseenparameter in theBaseEncoderclass to options like'ignore'or'impute'instead of'error'. This will allow the model to skip or impute unseen labels rather than raising an error [2]. -
Drop Unnecessary Columns: If the 'label' column is not needed for the prediction, as you've already tried, you can drop it from the test DataFrame before making predictions.
By implementing one of these solutions, you should be able to avoid the error and proceed with your predictions.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
This does look like a minor bug. Can you share a reproduce-able and self contained example? I'll check it out
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.