xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Dask XGboost Hanging

Open dawilliams-nvidia opened this issue 2 years ago • 0 comments

Attempting to use the Bayesian Optimization package to run many iterations of Dask XGBoost. The issue is that dask workers will disconnect during the optimization process, usually within the first 10 iterations. The worker drops from the cluster, and then is re-added, but dask computation is not restarted and the python process just continuously hangs making no progress. This only occurs when launching a dask cluster with the "dask-scheduler" and "dask-cuda-worker" commands, not when using localCUDACluster.

Steps to reproduce: Attached tarball contains needed code and full instructions in README. Please download tarball, extract, and follow there

For readability in this issue, here is a general summary of the code operation:

  1. Rapids NGC image (tested most recently with 22.04) with only change being the install of bayesian-optimization in jupyter notebook
  2. Launch container on DGX V100-32, run dask-scheduler from container
  3. Launch second container on same node, run dask-cuda-worker
  4. Included notebook loads dataframes (HIGGS dataset), builds dask-xgboost model, and calls the optimizer
  5. Within first 10 iterations (~2 min each), a worker will drop with "Worker process XX was killed by signal X" (see worker_error.txt) and then be readded to the cluster (see scheduler_error.txt). Jupyter cell will hang until manually killed

hpo_hanging_replicator.tar.gz

dawilliams-nvidia avatar Jun 16 '22 02:06 dawilliams-nvidia