xgboost Dask XGboost Hanging

Dask XGboost Hanging

Open dawilliams-nvidia opened this issue 2 years ago • 0 comments

Attempting to use the Bayesian Optimization package to run many iterations of Dask XGBoost. The issue is that dask workers will disconnect during the optimization process, usually within the first 10 iterations. The worker drops from the cluster, and then is re-added, but dask computation is not restarted and the python process just continuously hangs making no progress. This only occurs when launching a dask cluster with the "dask-scheduler" and "dask-cuda-worker" commands, not when using localCUDACluster.

Steps to reproduce: Attached tarball contains needed code and full instructions in README. Please download tarball, extract, and follow there

For readability in this issue, here is a general summary of the code operation:

Rapids NGC image (tested most recently with 22.04) with only change being the install of bayesian-optimization in jupyter notebook
Launch container on DGX V100-32, run dask-scheduler from container
Launch second container on same node, run dask-cuda-worker
Included notebook loads dataframes (HIGGS dataset), builds dask-xgboost model, and calls the optimizer
Within first 10 iterations (~2 min each), a worker will drop with "Worker process XX was killed by signal X" (see worker_error.txt) and then be readded to the cluster (see scheduler_error.txt). Jupyter cell will hang until manually killed

hpo_hanging_replicator.tar.gz

Jun 16 '22 02:06 dawilliams-nvidia

xgboost xgboost copied to clipboard

Dask XGboost Hanging

xgboost
xgboost copied to clipboard