xgboost Error running demo/guide-python/categorical.py: ValueError: Invalid classes inferred from unique values of `y`. Expected: [0 1 2], got [ 0. 1. nan]

xgboost 1.6.0+

Run this file: https://github.com/dmlc/xgboost/blob/master/demo/guide-python/categorical.py

Output:

train data set has got 143246 rows and 25 columns
train data set has got 143246 rows and 24 columns
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-1-761739d4be69>](https://localhost:8080/#) in <cell line: 115>()
    118     with TemporaryDirectory() as tmpdir:
    119         start = time()
--> 120         categorical_model(X, y, tmpdir)
    121         end = time()
    122         print("Duration:categorical", end - start)

2 frames
[/usr/local/lib/python3.10/dist-packages/xgboost/sklearn.py](https://localhost:8080/#) in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
   1355             or not (self.classes_ == expected_classes).all()
   1356         ):
-> 1357             raise ValueError(
   1358                 f"Invalid classes inferred from unique values of `y`.  "
   1359                 f"Expected: {expected_classes}, got {self.classes_}"

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2], got [ 0.  1. nan]

Nov 01 '23 16:11 dvmorris

That's weird, the label generated by the example script contains nan. I don't know the cause, could you please share the numpy/pandas version?

Nov 01 '23 19:11 trivialfis

XGBoost version: 2.0.0
np version: 1.23.5
pd version: 1.5.3

Nov 02 '23 15:11 dvmorris

Unfortunately, I can't reproduce the issue. We are running it on the CI as well: https://github.com/dmlc/xgboost/blob/be20df8c23c063f9b5ff242e66c29ebd66578ca6/tests/python/test_demos.py#L26 .

Nov 02 '23 17:11 trivialfis

Feel free to reopen if there is more information or a reproducible environment.

Nov 29 '23 05:11 trivialfis

Hi Guys; I am facing the same issue and it seems that I found a reproducible example:

import string
import xgboost
import pandas as pd
from sklearn.preprocessing import LabelEncoder

X=pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['col1','col2','col3','col4'])
y=np.random.choice(a=list(string.ascii_uppercase),size=X.shape[0],replace=True)

encoder=LabelEncoder()
y=encoder.fit_transform(y)

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)

clf=xgboost.XGBClassifier()
clf.fit(X_train,y_train)

And that will be terminated with the exact same error as @dvmorris mentioned. If you run a simple value_counts() on the y_train pd.Series you will see that some of the numbers are missing. However I dont know whether this is the causing factor.

xgboost.version 1.7.5 sklearn.version 1.2.2

Feb 26 '24 15:02 balintbiro

Yes, if there are missing values in y, this would be the error.

Feb 26 '24 16:02 trivialfis

Sorry, maybe I was not using the right expression. So there are no missing values in y but some of the labels that were encoded are not present in the training set. Or is it a different issue? 🤔

Feb 27 '24 06:02 balintbiro

I've run into the same problem. What I did is roll back to xgboost==1.5.0 and everything seems to be working. I'm now getting the warning: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1]. warnings.warn(label_encoder_deprecation_msg, UserWarning) I will try to upgrade to xgboost==2.0.3 again and figure out the warning message there.

Edit: The error was due to my dataset labels starting from index 1 instead of 0. I added the code below in order to make them start from index 0. Seems to be working fine now.

unique_classes = df.iloc[:, -1].unique()
class_mapping = {class_label: idx for idx, class_label in enumerate(unique_classes)}
df.iloc[:, -1] = df.iloc[:, -1].map(class_mapping)

May 28 '24 08:05 GTziolas

this is crazy, I'm not able to use xgboost because of this problem

Apr 04 '25 08:04 g-i-o-r-g-i-o