Dask support
I love the xgboost distribution package and what it enables, however when dealing with datasets or trees that do not fit into memory one needs to scale the task using a distributed framework like dask.
Dask already support xgboost natively using the sklearn api, and since xgboost-distribution relies on the original xgboost, I thought it would be quite easy to swap the underlying booster for a distributed one. since the API would be almost identical.
from distributed import LocalCluster, Client
import xgboost as xgb
def main(client: Client) -> None:
X, y = load_data()
regr = xgb.dask.DaskXGBRegressor(n_estimators=100, tree_method="gpu_hist")
regr.client = client # assign the client
regr.fit(X, y, eval_set=[(X, y)])
preds = regr.predict(X)
This problem also pops up when you want to use federated learning, in which case one would like to use a federated booster.
So my question is, would it be possible to swap the underlying xgboost booster in xgboost-distribution for the aforementioned xgb.dask.DaskXGBRegressor?
Hi, Thanks for raising. Just to understand the use case, you would like to train xgboost distribution on datasets that do not fit in memory?
I'll take a look into how feasible this is.
yes, exactly!