pangeo-cloud-federation icon indicating copy to clipboard operation
pangeo-cloud-federation copied to clipboard

kubecluster pods won't launch on ocean.pangeo.io

Open rabernat opened this issue 4 years ago • 18 comments

I think the recent changes have broken dask_kubernetes, which most ocean.pangeo.io users still use.

The pod dask-0000-0002-1606-6982-c2c57f03-27jmw5 for example:

https://console.cloud.google.com/kubernetes/pod/us-central1-b/dev-pangeo-io-cluster/ocean-prod/dask-0000-0002-1606-6982-c2c57f03-27jmw5/details?project=pangeo-181919

Cannot schedule pods: node(s) had taints that the pod didn't tolerate.

rabernat avatar Jun 10 '20 19:06 rabernat

cc @chiaral

rabernat avatar Jun 10 '20 19:06 rabernat

Dask gateway works fine.

I'm fine with no longer supporting dask_kuberenetes, but we need at least an announcement to warn users. And the dask jupyterlab widget is still configured for dask_kubernetes.

rabernat avatar Jun 10 '20 19:06 rabernat

@rabernat that link is broken now that the the pod is gone. Was that message followed by something like

 pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/pangeo-181919/zones/us-central1-b/instanceGroups/gke-dev-pangeo-io-cluster-dask-pool-2-8b0638c4-grp 55->58 (max: 300)}] 

I'm able to at least get workers with a KubeCluster. Unsure about auto-scaling the node pool, checking now.

TomAugspurger avatar Jun 10 '20 19:06 TomAugspurger

cluster = KubeCluster()
cluster.scale(20)

Eventually gave me 20 workers, some of which I think were on nodes that autoscaled up

$ kubectl -n ocean-prod get nods
NAME                                                  STATUS   ROLES    AGE     VERSION
...
gke-dev-pangeo-io-cluster-dask-pool-2-8b0638c4-9khj   Ready    <none>   4m8s    v1.15.9-gke.24
...

TomAugspurger avatar Jun 10 '20 20:06 TomAugspurger

I only tried to use the widget, I hadn't tried the command line for the KubeCluster. (I tried the command line for dask gateway) what do I import to call KubeCluster?

chiaral avatar Jun 10 '20 20:06 chiaral

OK, I also tried with the widget and got workers to show up eventually. Could you try again and wait a while to see if the workers show up?

KubeCluster is from from dask_kubernetes import KubeCluster.

On Wed, Jun 10, 2020 at 3:15 PM Chiara Lepore [email protected] wrote:

I only tried to use the widget, I hadn't tried the command line for the KubeCluster. (I tried the command line for dask gateway) what do I import to call KubeCluster?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/619#issuecomment-642233671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIV2F2YDDX2T3LOVLU3RV7SWZANCNFSM4N2WGBWA .

TomAugspurger avatar Jun 10 '20 21:06 TomAugspurger

Ok a note:

I tried to use gateway and my jobs always crushed. So I looked into it a bit more and there is a substantial difference between the set up available through KubeCluster and the one available through Gateway.

import dask_gateway
gateway = dask_gateway.Gateway()
options = gateway.cluster_options() 

options.worker_memory is set to 4Gb. The memory of each worker through KubeCluster is 11.5Gb.

When I tried to set options.worker_memory to 11.5 I get an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-aaadd3d6e83f> in <module>
----> 1 options.worker_memory=11.5

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in __setattr__(self, key, value)
    105 
    106     def __setattr__(self, key, value):
--> 107         return self._set(key, value, AttributeError)
    108 
    109     def __getitem__(self, key):

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in _set(self, key, value, exc_cls)
     97     def _set(self, key, value, exc_cls):
     98         try:
---> 99             self._fields[key].set(value)
    100         except KeyError:
    101             raise exc_cls("No option %r available" % key) from None

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in set(self, value)
    164 
    165     def set(self, value):
--> 166         self.value = self.validate(value)
    167         # Update all linked widgets
    168         for w in self._widgets:

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in validate(self, x)
    275         if not isinstance(x, float):
    276             raise TypeError("%s must be a float, got %r" % (self.field, x))
--> 277         return super().validate(x)
    278 
    279     def _widget(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in validate(self, x)
    240             raise ValueError("%s must be >= %f, got %s" % (self.field, self.min, x))
    241         if self.max is not None and x > self.max:
--> 242             raise ValueError("%s must be <= %f, got %s" % (self.field, self.max, x))
    243         return x
    244 

ValueError: worker_memory must be <= 8.000000, got 11.5

This might actually not work for me, but I think it would be a substantial difference for a lot of people, because the memory is cut to almost a third.

chiaral avatar Jun 10 '20 21:06 chiaral

OK, I also tried with the widget and got workers to show up eventually. Could you try again and wait a while to see if the workers show up? KubeCluster is from from dask_kubernetes import KubeCluster.

Sorry I just saw this after I posted my note. I can definitely try and use the widget, but the command line works great! I can try the widget again later.

chiaral avatar Jun 10 '20 21:06 chiaral

Good to know, thanks.

We can easily bump that up at https://github.com/pangeo-data/helm-chart/blob/b1a230a88d2587713eb440c9de402770f6ae32e6/pangeo/values.yaml#L104 .

On Wed, Jun 10, 2020 at 4:30 PM Chiara Lepore [email protected] wrote:

Ok a note:

I tried to use gateway and my jobs always crushed. So I looked into it a bit more and there is a substantial difference between the set up available through KubeCluster and the one available through Gateway.

import dask_gateway gateway = dask_gateway.Gateway() options = gateway.cluster_options()

options.worker_memory is set to 4Gb. The memory of each worker through KubeCluster is 11.5Gb.

When I tried to set options.worker_memory to 11.5 I get an error:


ValueError Traceback (most recent call last) in ----> 1 options.worker_memory=11.5

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in setattr(self, key, value) 105 106 def setattr(self, key, value): --> 107 return self._set(key, value, AttributeError) 108 109 def getitem(self, key):

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in _set(self, key, value, exc_cls) 97 def _set(self, key, value, exc_cls): 98 try: ---> 99 self._fields[key].set(value) 100 except KeyError: 101 raise exc_cls("No option %r available" % key) from None

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in set(self, value) 164 165 def set(self, value): --> 166 self.value = self.validate(value) 167 # Update all linked widgets 168 for w in self._widgets:

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in validate(self, x) 275 if not isinstance(x, float): 276 raise TypeError("%s must be a float, got %r" % (self.field, x)) --> 277 return super().validate(x) 278 279 def _widget(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in validate(self, x) 240 raise ValueError("%s must be >= %f, got %s" % (self.field, self.min, x)) 241 if self.max is not None and x > self.max: --> 242 raise ValueError("%s must be <= %f, got %s" % (self.field, self.max, x)) 243 return x 244

ValueError: worker_memory must be <= 8.000000, got 11.5

This might actually not work for me, but I think it would be a substantial difference for a lot of people, because the memory is cut to almost a third.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/619#issuecomment-642278121, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITWRLVSAL4LRNFNWULRV73NRANCNFSM4N2WGBWA .

TomAugspurger avatar Jun 10 '20 21:06 TomAugspurger

That would be great - You could raise the maximum to at least match the previous set up. So those relying on more memory per worker can have similar setups. thanks a lot.

chiaral avatar Jun 10 '20 21:06 chiaral

Thanks @TomAugspurger for checking on this. We had the mistaken impression that KubeCluster was broken, but I guess that's not the case.

I know a lot of things are in flux right now, so I thought it best to report something.

Thanks for your continued work on all this! Really appreciate it.

rabernat avatar Jun 11 '20 10:06 rabernat

Definitely good to report, since this is easily something the changes to the service account / permissions could have broken.

Opened https://github.com/pangeo-data/helm-chart/pull/131 for upping the limits on worker cores / memory. We'll need to redeploy with a bumped version of the pangeo helm chart once that's in.

TomAugspurger avatar Jun 11 '20 20:06 TomAugspurger

Thanks for everyone for working on this.

I am not convinced the original problem is resolved. I am unable to start any KubeClusters on ocean.pangeo.io.

I am doing this

from dask_kubernetes import KubeCluster
cluster = KubeCluster()
cluster.scale(4)

I am seeing this error message in my notebook.

Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py:596> exception=AssertionError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 603, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/deploy/spec.py", line 51, in _
    assert self.status == "running"
AssertionError
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py:596> exception=AssertionError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 603, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/deploy/spec.py", line 51, in _
    assert self.status == "running"
AssertionError
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py:596> exception=AssertionError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 603, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/deploy/spec.py", line 51, in _
    assert self.status == "running"
AssertionError
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py:596> exception=AssertionError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 603, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/deploy/spec.py", line 51, in _
    assert self.status == "running"
AssertionError

There seems to be one message for each requested worker.

There is no tracer of the pods in kubernetes:

$ kubectl -n ocean-prod get pods
NAME                                                 READY     STATUS    RESTARTS   AGE
api-ocean-prod-dask-gateway-5d544764dd-rrg6k         1/1       Running   0          4d
autohttps-8679569997-lszzw                           2/2       Running   0          14d
continuous-image-puller-6pz7z                        1/1       Running   0          5d
continuous-image-puller-prttd                        1/1       Running   0          4d
controller-ocean-prod-dask-gateway-5b57d44cd-52d89   1/1       Running   0          13d
dask-0000-0002-3606-2575-24a3d70a-2lxzxz             1/1       Running   0          3d
dask-0000-0002-3606-2575-32802620-9jchts             1/1       Running   0          3d
dask-0000-0002-3606-2575-cac84f11-e9wpj8             1/1       Running   0          2d
dask-0000-0002-3606-2575-d8ee7cfd-evl87p             1/1       Running   0          2d
dask-0000-0002-3606-2575-e3e68daa-72mzbd             1/1       Running   0          2d
dask-gateway-2b89k                                   1/1       Running   0          34m
dask-gateway-8193be291e2c48ff84a9628779d2b732        1/1       Running   0          1d
dask-gateway-bjm6q                                   1/1       Running   0          1d
dask-gateway-dcd6324a022349e38466b11a32aea02f        1/1       Running   0          34m
dask-gateway-zv5zf                                   1/1       Running   0          1d
hub-744b5b544c-p5xdb                                 1/1       Running   0          4d
jupyter-0000-2d0001-2d5154-2d5009                    1/1       Running   0          3d
jupyter-0000-2d0001-2d5234-2d177x                    1/1       Running   0          2h
jupyter-0000-2d0001-2d5999-2d4917                    1/1       Running   0          3m
jupyter-0000-2d0001-2d8991-2d5378                    1/1       Running   0          57m
jupyter-0000-2d0002-2d1606-2d6982                    1/1       Running   0          10m
jupyter-0000-2d0002-2d3254-2d1210                    1/1       Running   0          4d
jupyter-0000-2d0002-2d3606-2d2575                    1/1       Running   0          4d
jupyter-0000-2d0002-2d4313-2d2033                    1/1       Running   0          3h
jupyter-0000-2d0002-2d8654-2d6009                    1/1       Running   0          7h
jupyter-0000-2d0003-2d1094-2d0306                    1/1       Running   0          2h
proxy-5c85bc5574-h57j5                               1/1       Running   0          14d
traefik-ocean-prod-dask-gateway-59f4dfd9bd-vlgdd     1/1       Running   0          13d
user-scheduler-fb5f9548f-2lw5x                       1/1       Running   0          13d

There are no pending or starting-up pods. Workers never show up.

(Gateway is working.)

rabernat avatar Jun 15 '20 18:06 rabernat

This seems to be identical to https://github.com/PrefectHQ/prefect/issues/1841

This is due to an issue in your worker pod yaml or the proper RBAC isn't set up to create pods in the scheduler's current namespace

rabernat avatar Jun 15 '20 19:06 rabernat

Confirmed that I can see that error too. Dunno why I didn't see it earlier (though I think we had a staging -> prod merge in the meantime).

Will take a look quick.

TomAugspurger avatar Jun 15 '20 19:06 TomAugspurger

So I think to fix this, we would need to redeploy with the value daskkubernetes in https://github.com/pangeo-data/helm-chart/blob/b1a230a88d2587713eb440c9de402770f6ae32e6/pangeo/values.yaml#L25 changed to pangeo.

This gives the notebook Pods access to the kubernetes API so that they can launch other pods. @scottyhq would like to avoid enabling this for the icesat cluster. We can do this by changing pangeo.rbac.enabled in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d78fe03712e416e77ee83d13c44a01bc883ae09f/pangeo-deploy/values.yaml#L8-L9 to be false, and enable it just for ocean. Is that worth doing?

TomAugspurger avatar Jun 15 '20 20:06 TomAugspurger

Thanks for digging. I'm trying to understand what changed. It looks none of those config lines has changed for over a year...but nevertheless this KubeCluster problem just appeared last week. Can you help me understand what else changed recently in our confic that would cause this to break?

I don't think it's worth a lot of work to maintain support for KubeCluster. We should focus on transitioning to Dask Gateway. We have some downtime planned next week which we can use for that purpose. But it would be nice to get some kind of quick fix in for this, since the cluster is currently in a semi-broken state.

rabernat avatar Jun 15 '20 20:06 rabernat

Thanks for digging. I'm trying to understand what changed.

The relevant line is setting the serviceAccountName at https://github.com/pangeo-data/pangeo-cloud-federation/blob/d78fe03712e416e77ee83d13c44a01bc883ae09f/pangeo-deploy/values.yaml#L33. Previously that was daskkubernetes, which had permissions to create pods.

This shows up in kubectl -n ocean-prod get pod <...> -o yaml under spec.serviceAccount`.

TomAugspurger avatar Jun 15 '20 20:06 TomAugspurger