xgboost-distribution Dask support

I love the xgboost distribution package and what it enables, however when dealing with datasets or trees that do not fit into memory one needs to scale the task using a distributed framework like dask.

Dask already support xgboost natively using the sklearn api, and since xgboost-distribution relies on the original xgboost, I thought it would be quite easy to swap the underlying booster for a distributed one. since the API would be almost identical.

from distributed import LocalCluster, Client
import xgboost as xgb


def main(client: Client) -> None:
    X, y = load_data()
    regr = xgb.dask.DaskXGBRegressor(n_estimators=100, tree_method="gpu_hist")
    regr.client = client  # assign the client
    regr.fit(X, y, eval_set=[(X, y)])
    preds = regr.predict(X)

This problem also pops up when you want to use federated learning, in which case one would like to use a federated booster.

So my question is, would it be possible to swap the underlying xgboost booster in xgboost-distribution for the aforementioned xgb.dask.DaskXGBRegressor?

Dec 28 '22 13:12 hugocool

Hi, Thanks for raising. Just to understand the use case, you would like to train xgboost distribution on datasets that do not fit in memory?

I'll take a look into how feasible this is.

Jan 21 '23 17:01 CDonnerer

yes, exactly!

Feb 28 '23 13:02 hugocool

xgboost-distribution xgboost-distribution copied to clipboard

Dask support

xgboost-distribution
xgboost-distribution copied to clipboard