xgboost-distribution
xgboost-distribution copied to clipboard
Categorical values
Xgboost supports categorical values directly, without going through a one-hot sparse array.
In the example below, I specify the a
series to be a categorical int type. Then I pass the enable_categorical
argument to xgboost's DMatrix
data wrapper, and xgboost fits accordingly.
With xgboost-distribution, this doesn't work, since one passes the series/array directly to the sklearn-style .fit
method. I think that under-the-hood, the series/array is passed to a DMatrix function, but here enable_categorical
isn't set.
I can see two solutions:
- Create an api that allows
.train
in the same vein as xgboost - Add
enable_categorical
as a kwarg to eitherXGBDistribution()
or.fit()
, which is passed under-the-hood.
Example of it failing, with error, below:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":np.random.randint(1,5, size=20), "b":np.random.random(size=20), "y":np.random.random(size=20)})
df.a = df.a.astype("category")
X = df.drop(columns="y")
y = df.y
import xgboost as xgb
dtrain = xgb.DMatrix(X, label=y, enable_categorical=True)
bst = xgb.train({}, dtrain) # works
# bst.predict(dtrain)
model = XGBDistribution()
model.fit(X, y) # fails
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_68/367534232.py in <module>
15
16 model = XGBDistribution()
---> 17 model.fit(X, y) # fails
/usr/local/lib/python3.9/site-packages/xgboost_distribution/model.py in fit(self, X, y, sample_weight, eval_set, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, feature_weights, callbacks)
117 base_margin_eval_set = None
118
--> 119 train_dmatrix, evals = _wrap_evaluation_matrices(
120 missing=self.missing,
121 X=X,
/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py in _wrap_evaluation_matrices(missing, X, y, group, qid, sample_weight, base_margin, feature_weights, eval_set, sample_weight_eval_set, base_margin_eval_set, eval_group, eval_qid, create_dmatrix, label_transform)
234
235 """
--> 236 train_dmatrix = create_dmatrix(
237 data=X,
238 label=label_transform(y),
/usr/local/lib/python3.9/site-packages/xgboost_distribution/model.py in <lambda>(**kwargs)
131 eval_group=None,
132 eval_qid=None,
--> 133 create_dmatrix=lambda **kwargs: DMatrix(nthread=self.n_jobs, **kwargs),
134 label_transform=lambda x: x,
135 )
/usr/local/lib/python3.9/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
434 for k, arg in zip(sig.parameters, args):
435 kwargs[k] = arg
--> 436 return f(**kwargs)
437
438 return inner_f
/usr/local/lib/python3.9/site-packages/xgboost/core.py in __init__(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, group, qid, label_lower_bound, label_upper_bound, feature_weights, enable_categorical)
539 from .data import dispatch_data_backend
540
--> 541 handle, feature_names, feature_types = dispatch_data_backend(
542 data,
543 missing=self.missing,
/usr/local/lib/python3.9/site-packages/xgboost/data.py in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical)
571 return _from_tuple(data, missing, feature_names, feature_types)
572 if _is_pandas_df(data):
--> 573 return _from_pandas_df(data, enable_categorical, missing, threads,
574 feature_names, feature_types)
575 if _is_pandas_series(data):
/usr/local/lib/python3.9/site-packages/xgboost/data.py in _from_pandas_df(data, enable_categorical, missing, nthread, feature_names, feature_types)
256 def _from_pandas_df(data, enable_categorical, missing, nthread,
257 feature_names, feature_types):
--> 258 data, feature_names, feature_types = _transform_pandas_df(
259 data, enable_categorical, feature_names, feature_types)
260 return _from_numpy_array(data, missing, nthread, feature_names,
/usr/local/lib/python3.9/site-packages/xgboost/data.py in _transform_pandas_df(data, enable_categorical, feature_names, feature_types, meta, meta_type)
221 categorical type is supplied, DMatrix parameter
222 `enable_categorical` must be set to `True`."""
--> 223 raise ValueError(msg + ', '.join(bad_fields))
224
225 if feature_names is None and meta is None:
ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When
categorical type is supplied, DMatrix parameter
`enable_categorical` must be set to `True`.a
This looks like it will be fixed automatically in the next version of xgboost. Current master takes a enable_categorical: bool
argument to XGBModel
, which is inherited by XGBDistribution
. So in the future one can do, e.g.:
model = XGBDistribution(enable_categorical=True)
model.fit(...)
For now, I've fixed it locally in this manner. (I'm in a container environment and just needed a quick fix)
Hi @thomasaarholt , Thanks for flagging, this looks like a nice feature! Indeed, it appears that this will be included in the next version of xgboost, so I'll update then. For now, just a word of caution for this feature, from the xgboost docs of the current release:
Experimental support of specializing for categorical features. Do not set to True unless you are interested in development. Currently it’s only available for gpu_hist tree method with 1 vs rest (one hot) categorical split.
Since XGBDistribution
uses custom objectives for estimating distributions, the gpu_hist
tree method might not work as expected (the objective functions would have to be re-implemented in xgboost source code to work with GPUs). However, I haven't yet tried setting enable_categorical=True
, so if you do get sensible results with the above, potentially it works regardless..