Verify the benchmark of XgboostClassifier with initial xgboost
Hello, I find maybe a bug about the XgboostClassifier in dask.xgboost.
from sklearn.datasets import load_iris
import dask.dataframe as dd
import pandas as pd
dataset = load_iris()
train = dataset.data
target = dataset.target
pdf = pd.DataFrame(data = train,columns=["1","2","3","4"])
pdf_y = pd.Series(target)
# pass the multi-class to binary problem to easily show the bug.
pdf_y.replace(2,1,inplace =True)
from xgboost import XGBClassifier
est = XGBClassifier(n_estimators=30,max_depth=7,verbosity=0,learning_rate= 0.1)
est.fit(pdf, pdf_y)
est.score(pdf, pdf_y)
with the intial xgboost , we can easily get 100% accuracy.
from dask_ml.xgboost import XGBClassifier
from distributed import Client
client = Client()
est = XGBClassifier(n_estimators=30,max_depth=7,verbosity=1,learning_rate= 0.1)
df = dd.from_pandas(pdf,chunksize=640000)
df_y = dd.from_pandas(pdf_y,chunksize=640000).astype(int)
est.fit(df, df_y )
est.score(df, df_y )
with the same parameter and the same data, we can only get 66% accuracy and the problem is that the estimator with predict() only returns 1 all the time. The 66% have no sense.
This is a simple example to show the bug. I have tested on my project with titanic dataset and it has the same problem.
est.predict(df).compute()
return 1 for all the df.
Does the same issue affect distributed XGBoost without dask (e.g. https://xgboost.readthedocs.io/en/release_0.72/tutorials/aws_yarn.html)?
I haven't tried it, maybe i will try next Monday. but I found that https://xgboost.readthedocs.io/en/release_0.72/tutorials/aws_yarn.html is not existed in the latest version. https://xgboost.readthedocs.io/en/latest/tutorials/aws_yarn.html It's really intresting.