dask-gke icon indicating copy to clipboard operation
dask-gke copied to clipboard

periodic failures when creating a new cluster

Open bmabey opened this issue 8 years ago • 3 comments

I have run into this a few times:

$ dask create foo cluster.yml
....
replicationcontroller "jupyter-notebook" created
replicationcontroller "dask-scheduler" created
replicationcontroller "dask-worker" created
INFO: Waiting for kubernetes... (^C to stop)
INFO: Services are up
INFO: Services are up
The connection to the server x.x.x.x was refused - did you specify the right host or port?
CRITICAL: Traceback (most recent call last):
  File "/Users/bmabey/anaconda/envs/drugdiscovery/lib/python3.6/site-packages/dask_kubernetes-0.0.1-py3.6.egg/dask_kubernetes/cli/main.py", line 26, in start

Is there anything I can do to have the cluster continue to be setup after this? Eventually the dask info foo returned information but I was unable to connect to any of the services.

bmabey avatar Nov 02 '17 20:11 bmabey

Do you have any idea what is actually going on? We can certainly put more try/excepts around trying to connect to the cluster, but I'm not sure how that will help when we don't understand the cause.

Could you possibly debug where in the code the exception is happening? Any idea how the message "Services are up" can have appeared twice?

martindurant avatar Nov 02 '17 20:11 martindurant

The double INFO: Services are up was a copy/paste error.

In general when this happens once the info command comes back then I am able to connect. The one time that I couldn't connect I had tried messing with the pods and so that is probably what broke it.

So I think what needs to happen is for this subprocess call to be retried with some back-offs and evntual timeouts:

subprocess.CalledProcessError: Command 'kubectl --output=json --context gke_foo_us-east1-b_cluster get services' returned non-zero exit status 1.

bmabey avatar Nov 02 '17 20:11 bmabey

Probably it would be reasonable to put

try:
    ...
except:
    continue

around the calls to get_pods and services_in_context calls within wait_until_ready (but not in the functions themselves - if they are called directly, they should raise I think).

Would you like to contribute this?

martindurant avatar Nov 02 '17 20:11 martindurant