dask-xgboost
dask-xgboost copied to clipboard
Get dask-gateway scheduler address
When connecting to a dask-gateway the client.scheduler_address
is a proxy address
>>>client.scheduler.address
'gateway://dask.training.anaconda.com:8786/4fd53916f0214703934701aa7a7eaf85'
I was able to solve this with the following in core::_train
with client.scheduler_info()['address']
)
# Start the XGBoost tracker on the Dask scheduler
host, port = parse_host_port(client.scheduler_info()['address'])
env = yield client._run_on_scheduler(
start_tracker, host.strip("/:"), len(worker_map)
)
However, I get the following warning.
>>> from dask_xgboost import XGBRegressor
>>> xgb = XGBRegressor()
>>> xgb.fit(X, y)
/Users/adefusco/Applications/miniconda3/envs/xgb/lib/python3.7/site-packages/distributed/client.py:3299: RuntimeWarning: coroutine 'Client._update_scheduler_info' was never awaited
self.sync(self._update_scheduler_info)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
I have verified that his update works correctly on a 9m row training set and scales linearly from 4 to 8 workers (2cores/worker). Is this the correct approach to get the actual scheduler address?
I'll look into this soon. I'm planning a refactor to move this logic into distributed itself.
Hi, is there any update on this issue? I'm using the Dask implementation in XGBoost itself, rather than this library, so my feeling is this may be a bug with Dask rather than XGBoost?
I'm using LocalCluster with Dask 2.28.0.
dtrain = xgb.dask.DaskDMatrix(client, X, y)
output = xgb.dask.train(
client,
{
'verbosity': 2,
'tree_method': 'hist',
'objective': 'binary:logistic'
},
dtrain,
num_boost_round=4,
evals=[(dtrain, 'train')]
)
/root/anaconda3/lib/python3.7/site-packages/distributed/client.py:3530: RuntimeWarning: coroutine 'Client._update_scheduler_info' was never awaited
self.sync(self._update_scheduler_info)