xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

xgboost dask with specified address/port tries to re-bind to same port in some situations

Open ntabris opened this issue 2 years ago • 2 comments

The default behavior of xgboost with dask is to bind a listener to an ephemeral port on the scheduler.

We had been trying to deploy dask in a way where we only opened specified ports on the scheduler (specifically, running the dask scheduler in a container using bridge networking mode). We tried using xgboost.scheduler_address to set the address/port to one that we explicitly opened.

The problem is that in certain circumstances it appears that xgboost tries to re-bind using xgboost.scheduler_address, which doesn't work... it gets Address already in use error. This happens when trying to run training a second time, and might happen even when running training once but on a larger dataset (not sure about this though).

More details here: https://github.com/coiled/coiled-runtime/issues/150

I'm guessing it should either release or re-use the listener it originally bound, and it's not doing that.

(In case it matters, our solution is probably going to be not using a networking mode that requires opening explicit ports.)

ntabris avatar May 31 '22 18:05 ntabris

It should release the port once the training is finished. I'm not sure what's happening. Will try to reproduce it on my end.

trivialfis avatar Jun 01 '22 03:06 trivialfis

Perhaps the tracker is not deleted: https://github.com/dmlc/xgboost/blob/545fd4548e303931dafd98d6606454fcdc2b8f2f/python-package/xgboost/tracker.py#L205

trivialfis avatar Jun 01 '22 04:06 trivialfis