xgboost-distribution icon indicating copy to clipboard operation
xgboost-distribution copied to clipboard

Dask support

Open hugocool opened this issue 1 year ago • 2 comments

I love the xgboost distribution package and what it enables, however when dealing with datasets or trees that do not fit into memory one needs to scale the task using a distributed framework like dask.

Dask already support xgboost natively using the sklearn api, and since xgboost-distribution relies on the original xgboost, I thought it would be quite easy to swap the underlying booster for a distributed one. since the API would be almost identical.

from distributed import LocalCluster, Client
import xgboost as xgb


def main(client: Client) -> None:
    X, y = load_data()
    regr = xgb.dask.DaskXGBRegressor(n_estimators=100, tree_method="gpu_hist")
    regr.client = client  # assign the client
    regr.fit(X, y, eval_set=[(X, y)])
    preds = regr.predict(X)

This problem also pops up when you want to use federated learning, in which case one would like to use a federated booster.

So my question is, would it be possible to swap the underlying xgboost booster in xgboost-distribution for the aforementioned xgb.dask.DaskXGBRegressor?

hugocool avatar Dec 28 '22 13:12 hugocool

Hi, Thanks for raising. Just to understand the use case, you would like to train xgboost distribution on datasets that do not fit in memory?

I'll take a look into how feasible this is.

CDonnerer avatar Jan 21 '23 17:01 CDonnerer

yes, exactly!

hugocool avatar Feb 28 '23 13:02 hugocool