dask-kubernetes icon indicating copy to clipboard operation
dask-kubernetes copied to clipboard

Re-create the socket object in case of a connection failure

Open danieldanciu opened this issue 1 year ago • 0 comments

When connecting to my dask cluster using

cluster = HelmCluster(release_name='foo')

I kept getting the following error: ConnectionError: kubectl port forward failed. After debugging a bit, it turned out that the initial connection on the socket failed because the port forwarding was not quite ready yet, and then all the 99 subsequent connect_ex() calls failed because the socket object was messed up. Re-creating the socket object at each retry fixed the issue. Sure, there is a small cost to this, but given that we have a sleep(2) in that loop, performance is hardly a concern. Aslo - do we really need 100 retry attempts? That means 200 seconds of retries, so more than 3 minutes until users get an answer back - I would say that even on the slowest machines the port forwarding should be done after a few seconds. The current implementation just gives the impression that everything hangs.

danieldanciu avatar Jul 18 '22 19:07 danieldanciu