periodic failures when creating a new cluster
I have run into this a few times:
$ dask create foo cluster.yml
....
replicationcontroller "jupyter-notebook" created
replicationcontroller "dask-scheduler" created
replicationcontroller "dask-worker" created
INFO: Waiting for kubernetes... (^C to stop)
INFO: Services are up
INFO: Services are up
The connection to the server x.x.x.x was refused - did you specify the right host or port?
CRITICAL: Traceback (most recent call last):
File "/Users/bmabey/anaconda/envs/drugdiscovery/lib/python3.6/site-packages/dask_kubernetes-0.0.1-py3.6.egg/dask_kubernetes/cli/main.py", line 26, in start
Is there anything I can do to have the cluster continue to be setup after this? Eventually the dask info foo returned information but I was unable to connect to any of the services.
Do you have any idea what is actually going on? We can certainly put more try/excepts around trying to connect to the cluster, but I'm not sure how that will help when we don't understand the cause.
Could you possibly debug where in the code the exception is happening? Any idea how the message "Services are up" can have appeared twice?
The double INFO: Services are up was a copy/paste error.
In general when this happens once the info command comes back then I am able to connect. The one time that I couldn't connect I had tried messing with the pods and so that is probably what broke it.
So I think what needs to happen is for this subprocess call to be retried with some back-offs and evntual timeouts:
subprocess.CalledProcessError: Command 'kubectl --output=json --context gke_foo_us-east1-b_cluster get services' returned non-zero exit status 1.
Probably it would be reasonable to put
try:
...
except:
continue
around the calls to get_pods and services_in_context calls within wait_until_ready (but not in the functions themselves - if they are called directly, they should raise I think).
Would you like to contribute this?