Error running demo/guide-python/categorical.py: ValueError: Invalid classes inferred from unique values of `y`. Expected: [0 1 2], got [ 0. 1. nan]
xgboost 1.6.0+
Run this file: https://github.com/dmlc/xgboost/blob/master/demo/guide-python/categorical.py
Output:
train data set has got 143246 rows and 25 columns
train data set has got 143246 rows and 24 columns
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-1-761739d4be69>](https://localhost:8080/#) in <cell line: 115>()
118 with TemporaryDirectory() as tmpdir:
119 start = time()
--> 120 categorical_model(X, y, tmpdir)
121 end = time()
122 print("Duration:categorical", end - start)
2 frames
[/usr/local/lib/python3.10/dist-packages/xgboost/sklearn.py](https://localhost:8080/#) in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
1355 or not (self.classes_ == expected_classes).all()
1356 ):
-> 1357 raise ValueError(
1358 f"Invalid classes inferred from unique values of `y`. "
1359 f"Expected: {expected_classes}, got {self.classes_}"
ValueError: Invalid classes inferred from unique values of `y`. Expected: [0 1 2], got [ 0. 1. nan]
That's weird, the label generated by the example script contains nan. I don't know the cause, could you please share the numpy/pandas version?
XGBoost version: 2.0.0
np version: 1.23.5
pd version: 1.5.3
Unfortunately, I can't reproduce the issue. We are running it on the CI as well: https://github.com/dmlc/xgboost/blob/be20df8c23c063f9b5ff242e66c29ebd66578ca6/tests/python/test_demos.py#L26 .
Feel free to reopen if there is more information or a reproducible environment.
Hi Guys; I am facing the same issue and it seems that I found a reproducible example:
import string
import xgboost
import pandas as pd
from sklearn.preprocessing import LabelEncoder
X=pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['col1','col2','col3','col4'])
y=np.random.choice(a=list(string.ascii_uppercase),size=X.shape[0],replace=True)
encoder=LabelEncoder()
y=encoder.fit_transform(y)
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
clf=xgboost.XGBClassifier()
clf.fit(X_train,y_train)
And that will be terminated with the exact same error as @dvmorris mentioned. If you run a simple value_counts() on the y_train pd.Series you will see that some of the numbers are missing. However I dont know whether this is the causing factor.
xgboost.version 1.7.5 sklearn.version 1.2.2
Yes, if there are missing values in y, this would be the error.
Sorry, maybe I was not using the right expression. So there are no missing values in y but some of the labels that were encoded are not present in the training set. Or is it a different issue? 🤔
I've run into the same problem. What I did is roll back to xgboost==1.5.0 and everything seems to be working. I'm now getting the warning:
UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1]. warnings.warn(label_encoder_deprecation_msg, UserWarning)
I will try to upgrade to xgboost==2.0.3 again and figure out the warning message there.
Edit: The error was due to my dataset labels starting from index 1 instead of 0. I added the code below in order to make them start from index 0. Seems to be working fine now.
unique_classes = df.iloc[:, -1].unique()
class_mapping = {class_label: idx for idx, class_label in enumerate(unique_classes)}
df.iloc[:, -1] = df.iloc[:, -1].map(class_mapping)
this is crazy, I'm not able to use xgboost because of this problem