Invalid classes inferred from unique values of `y`.
Hi All,
I am facing a problem with the mixture of LabelEncoder and XGBClassifier. Below is the reproducible example that causes the problem.
import string
import xgboost
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
X=pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['col1','col2','col3','col4'])
y=np.random.choice(a=list(string.ascii_uppercase),size=X.shape[0],replace=True)
encoder=LabelEncoder()
y=encoder.fit_transform(y)
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
clf=xgboost.XGBClassifier()
clf.fit(X_train,y_train)
Any ideas why this training is terminated? This issue is a bit similar to https://github.com/dmlc/xgboost/issues/9747 however there are no nan values in y. In my opinion, this is related to XGBoost since it is possible to train other classifiers no problem. Thanks in advance!
I set the np.random.seed(0) and reproduced the error. XGBoost requires encoded labels, meaning the label should start from 0 and end at n_classes - 1. In your example, np.unique(y_train):
[ 0 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
As one can see, it's not contiguous. The solution is that you should fit the label encoder on the training data instead. The second issue is that, since the label consists of discrete values, you might consider train_test_split(X, y, stratify=y) for properly distributed classes.
While this is a very reasonable thing to require from users, it seems to be a requirement for full scikit-learn compatibility according to their docs and tests: https://scikit-learn.org/stable/developers/develop.html#specific-models
I see what you mean. Thank y'all for the answers!
I also agree this is too restrictive. all other sklearn models are fine with this, and this case can occur when doing a cross_val_score using xgboost, even if using stratified because it can miss one or more classes still.