xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Invalid classes inferred from unique values of `y`.

Open balintbiro opened this issue 2 years ago • 4 comments

Hi All,

I am facing a problem with the mixture of LabelEncoder and XGBClassifier. Below is the reproducible example that causes the problem.

import string
import xgboost
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

X=pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['col1','col2','col3','col4'])
y=np.random.choice(a=list(string.ascii_uppercase),size=X.shape[0],replace=True)
encoder=LabelEncoder()
y=encoder.fit_transform(y)

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)

clf=xgboost.XGBClassifier()
clf.fit(X_train,y_train)

Any ideas why this training is terminated? This issue is a bit similar to https://github.com/dmlc/xgboost/issues/9747 however there are no nan values in y. In my opinion, this is related to XGBoost since it is possible to train other classifiers no problem. Thanks in advance!

balintbiro avatar Feb 27 '24 07:02 balintbiro

I set the np.random.seed(0) and reproduced the error. XGBoost requires encoded labels, meaning the label should start from 0 and end at n_classes - 1. In your example, np.unique(y_train):

[ 0  1  2  3  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]

As one can see, it's not contiguous. The solution is that you should fit the label encoder on the training data instead. The second issue is that, since the label consists of discrete values, you might consider train_test_split(X, y, stratify=y) for properly distributed classes.

trivialfis avatar Feb 27 '24 17:02 trivialfis

While this is a very reasonable thing to require from users, it seems to be a requirement for full scikit-learn compatibility according to their docs and tests: https://scikit-learn.org/stable/developers/develop.html#specific-models

david-cortes avatar Feb 28 '24 19:02 david-cortes

I see what you mean. Thank y'all for the answers!

balintbiro avatar Mar 01 '24 08:03 balintbiro

I also agree this is too restrictive. all other sklearn models are fine with this, and this case can occur when doing a cross_val_score using xgboost, even if using stratified because it can miss one or more classes still.

fpt-ian avatar Apr 12 '24 05:04 fpt-ian