xgboost-distribution Categorical values

Xgboost supports categorical values directly, without going through a one-hot sparse array. In the example below, I specify the a series to be a categorical int type. Then I pass the enable_categorical argument to xgboost's DMatrix data wrapper, and xgboost fits accordingly.

With xgboost-distribution, this doesn't work, since one passes the series/array directly to the sklearn-style .fit method. I think that under-the-hood, the series/array is passed to a DMatrix function, but here enable_categorical isn't set.

I can see two solutions:

Create an api that allows .train in the same vein as xgboost
Add enable_categorical as a kwarg to either XGBDistribution() or .fit(), which is passed under-the-hood.

Example of it failing, with error, below:

import pandas as pd
import numpy as np

df = pd.DataFrame({"a":np.random.randint(1,5, size=20), "b":np.random.random(size=20), "y":np.random.random(size=20)})
df.a = df.a.astype("category")

X = df.drop(columns="y")
y = df.y

import xgboost as xgb
dtrain = xgb.DMatrix(X, label=y, enable_categorical=True)

bst = xgb.train({}, dtrain) # works
# bst.predict(dtrain)

model = XGBDistribution()
model.fit(X, y) # fails

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_68/367534232.py in <module>
     15 
     16 model = XGBDistribution()
---> 17 model.fit(X, y) # fails

/usr/local/lib/python3.9/site-packages/xgboost_distribution/model.py in fit(self, X, y, sample_weight, eval_set, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, feature_weights, callbacks)
    117             base_margin_eval_set = None
    118 
--> 119         train_dmatrix, evals = _wrap_evaluation_matrices(
    120             missing=self.missing,
    121             X=X,

/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py in _wrap_evaluation_matrices(missing, X, y, group, qid, sample_weight, base_margin, feature_weights, eval_set, sample_weight_eval_set, base_margin_eval_set, eval_group, eval_qid, create_dmatrix, label_transform)
    234 
    235     """
--> 236     train_dmatrix = create_dmatrix(
    237         data=X,
    238         label=label_transform(y),

/usr/local/lib/python3.9/site-packages/xgboost_distribution/model.py in <lambda>(**kwargs)
    131             eval_group=None,
    132             eval_qid=None,
--> 133             create_dmatrix=lambda **kwargs: DMatrix(nthread=self.n_jobs, **kwargs),
    134             label_transform=lambda x: x,
    135         )

/usr/local/lib/python3.9/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
    434         for k, arg in zip(sig.parameters, args):
    435             kwargs[k] = arg
--> 436         return f(**kwargs)
    437 
    438     return inner_f

/usr/local/lib/python3.9/site-packages/xgboost/core.py in __init__(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, group, qid, label_lower_bound, label_upper_bound, feature_weights, enable_categorical)
    539         from .data import dispatch_data_backend
    540 
--> 541         handle, feature_names, feature_types = dispatch_data_backend(
    542             data,
    543             missing=self.missing,

/usr/local/lib/python3.9/site-packages/xgboost/data.py in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical)
    571         return _from_tuple(data, missing, feature_names, feature_types)
    572     if _is_pandas_df(data):
--> 573         return _from_pandas_df(data, enable_categorical, missing, threads,
    574                                feature_names, feature_types)
    575     if _is_pandas_series(data):

/usr/local/lib/python3.9/site-packages/xgboost/data.py in _from_pandas_df(data, enable_categorical, missing, nthread, feature_names, feature_types)
    256 def _from_pandas_df(data, enable_categorical, missing, nthread,
    257                     feature_names, feature_types):
--> 258     data, feature_names, feature_types = _transform_pandas_df(
    259         data, enable_categorical, feature_names, feature_types)
    260     return _from_numpy_array(data, missing, nthread, feature_names,

/usr/local/lib/python3.9/site-packages/xgboost/data.py in _transform_pandas_df(data, enable_categorical, feature_names, feature_types, meta, meta_type)
    221                 categorical type is supplied, DMatrix parameter
    222                 `enable_categorical` must be set to `True`."""
--> 223         raise ValueError(msg + ', '.join(bad_fields))
    224 
    225     if feature_names is None and meta is None:

ValueError: DataFrame.dtypes for data must be int, float, bool or categorical.  When
                categorical type is supplied, DMatrix parameter
                `enable_categorical` must be set to `True`.a

Sep 15 '21 14:09 thomasaarholt

This looks like it will be fixed automatically in the next version of xgboost. Current master takes a enable_categorical: bool argument to XGBModel, which is inherited by XGBDistribution. So in the future one can do, e.g.:

model = XGBDistribution(enable_categorical=True)
model.fit(...)

For now, I've fixed it locally in this manner. (I'm in a container environment and just needed a quick fix)

Sep 22 '21 08:09 thomasaarholt

Hi @thomasaarholt , Thanks for flagging, this looks like a nice feature! Indeed, it appears that this will be included in the next version of xgboost, so I'll update then. For now, just a word of caution for this feature, from the xgboost docs of the current release:

Experimental support of specializing for categorical features. Do not set to True unless you are interested in development. Currently it’s only available for gpu_hist tree method with 1 vs rest (one hot) categorical split.

Since XGBDistribution uses custom objectives for estimating distributions, the gpu_hist tree method might not work as expected (the objective functions would have to be re-implemented in xgboost source code to work with GPUs). However, I haven't yet tried setting enable_categorical=True, so if you do get sensible results with the above, potentially it works regardless..

Oct 10 '21 16:10 CDonnerer

xgboost-distribution xgboost-distribution copied to clipboard

Categorical values

xgboost-distribution
xgboost-distribution copied to clipboard