pangeo-cloud-federation
pangeo-cloud-federation copied to clipboard
kubecluster pods won't launch on ocean.pangeo.io
I think the recent changes have broken dask_kubernetes, which most ocean.pangeo.io users still use.
The pod dask-0000-0002-1606-6982-c2c57f03-27jmw5
for example:
https://console.cloud.google.com/kubernetes/pod/us-central1-b/dev-pangeo-io-cluster/ocean-prod/dask-0000-0002-1606-6982-c2c57f03-27jmw5/details?project=pangeo-181919
Cannot schedule pods: node(s) had taints that the pod didn't tolerate.
cc @chiaral
Dask gateway works fine.
I'm fine with no longer supporting dask_kuberenetes, but we need at least an announcement to warn users. And the dask jupyterlab widget is still configured for dask_kubernetes.
@rabernat that link is broken now that the the pod is gone. Was that message followed by something like
pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/pangeo-181919/zones/us-central1-b/instanceGroups/gke-dev-pangeo-io-cluster-dask-pool-2-8b0638c4-grp 55->58 (max: 300)}]
I'm able to at least get workers with a KubeCluster. Unsure about auto-scaling the node pool, checking now.
cluster = KubeCluster()
cluster.scale(20)
Eventually gave me 20 workers, some of which I think were on nodes that autoscaled up
$ kubectl -n ocean-prod get nods
NAME STATUS ROLES AGE VERSION
...
gke-dev-pangeo-io-cluster-dask-pool-2-8b0638c4-9khj Ready <none> 4m8s v1.15.9-gke.24
...
I only tried to use the widget, I hadn't tried the command line for the KubeCluster. (I tried the command line for dask gateway) what do I import to call KubeCluster?
OK, I also tried with the widget and got workers to show up eventually. Could you try again and wait a while to see if the workers show up?
KubeCluster is from from dask_kubernetes import KubeCluster
.
On Wed, Jun 10, 2020 at 3:15 PM Chiara Lepore [email protected] wrote:
I only tried to use the widget, I hadn't tried the command line for the KubeCluster. (I tried the command line for dask gateway) what do I import to call KubeCluster?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/619#issuecomment-642233671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIV2F2YDDX2T3LOVLU3RV7SWZANCNFSM4N2WGBWA .
Ok a note:
I tried to use gateway and my jobs always crushed. So I looked into it a bit more and there is a substantial difference between the set up available through KubeCluster and the one available through Gateway.
import dask_gateway
gateway = dask_gateway.Gateway()
options = gateway.cluster_options()
options.worker_memory
is set to 4Gb. The memory of each worker through KubeCluster is 11.5Gb.
When I tried to set options.worker_memory
to 11.5 I get an error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-aaadd3d6e83f> in <module>
----> 1 options.worker_memory=11.5
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in __setattr__(self, key, value)
105
106 def __setattr__(self, key, value):
--> 107 return self._set(key, value, AttributeError)
108
109 def __getitem__(self, key):
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in _set(self, key, value, exc_cls)
97 def _set(self, key, value, exc_cls):
98 try:
---> 99 self._fields[key].set(value)
100 except KeyError:
101 raise exc_cls("No option %r available" % key) from None
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in set(self, value)
164
165 def set(self, value):
--> 166 self.value = self.validate(value)
167 # Update all linked widgets
168 for w in self._widgets:
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in validate(self, x)
275 if not isinstance(x, float):
276 raise TypeError("%s must be a float, got %r" % (self.field, x))
--> 277 return super().validate(x)
278
279 def _widget(self):
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in validate(self, x)
240 raise ValueError("%s must be >= %f, got %s" % (self.field, self.min, x))
241 if self.max is not None and x > self.max:
--> 242 raise ValueError("%s must be <= %f, got %s" % (self.field, self.max, x))
243 return x
244
ValueError: worker_memory must be <= 8.000000, got 11.5
This might actually not work for me, but I think it would be a substantial difference for a lot of people, because the memory is cut to almost a third.
OK, I also tried with the widget and got workers to show up eventually. Could you try again and wait a while to see if the workers show up? KubeCluster is from
from dask_kubernetes import KubeCluster
.
Sorry I just saw this after I posted my note. I can definitely try and use the widget, but the command line works great! I can try the widget again later.
Good to know, thanks.
We can easily bump that up at https://github.com/pangeo-data/helm-chart/blob/b1a230a88d2587713eb440c9de402770f6ae32e6/pangeo/values.yaml#L104 .
On Wed, Jun 10, 2020 at 4:30 PM Chiara Lepore [email protected] wrote:
Ok a note:
I tried to use gateway and my jobs always crushed. So I looked into it a bit more and there is a substantial difference between the set up available through KubeCluster and the one available through Gateway.
import dask_gateway gateway = dask_gateway.Gateway() options = gateway.cluster_options()
options.worker_memory is set to 4Gb. The memory of each worker through KubeCluster is 11.5Gb.
When I tried to set options.worker_memory to 11.5 I get an error:
ValueError Traceback (most recent call last)
in ----> 1 options.worker_memory=11.5 /srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in setattr(self, key, value) 105 106 def setattr(self, key, value): --> 107 return self._set(key, value, AttributeError) 108 109 def getitem(self, key):
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in _set(self, key, value, exc_cls) 97 def _set(self, key, value, exc_cls): 98 try: ---> 99 self._fields[key].set(value) 100 except KeyError: 101 raise exc_cls("No option %r available" % key) from None
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in set(self, value) 164 165 def set(self, value): --> 166 self.value = self.validate(value) 167 # Update all linked widgets 168 for w in self._widgets:
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in validate(self, x) 275 if not isinstance(x, float): 276 raise TypeError("%s must be a float, got %r" % (self.field, x)) --> 277 return super().validate(x) 278 279 def _widget(self):
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/options.py in validate(self, x) 240 raise ValueError("%s must be >= %f, got %s" % (self.field, self.min, x)) 241 if self.max is not None and x > self.max: --> 242 raise ValueError("%s must be <= %f, got %s" % (self.field, self.max, x)) 243 return x 244
ValueError: worker_memory must be <= 8.000000, got 11.5
This might actually not work for me, but I think it would be a substantial difference for a lot of people, because the memory is cut to almost a third.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/619#issuecomment-642278121, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITWRLVSAL4LRNFNWULRV73NRANCNFSM4N2WGBWA .
That would be great - You could raise the maximum to at least match the previous set up. So those relying on more memory per worker can have similar setups. thanks a lot.
Thanks @TomAugspurger for checking on this. We had the mistaken impression that KubeCluster was broken, but I guess that's not the case.
I know a lot of things are in flux right now, so I thought it best to report something.
Thanks for your continued work on all this! Really appreciate it.
Definitely good to report, since this is easily something the changes to the service account / permissions could have broken.
Opened https://github.com/pangeo-data/helm-chart/pull/131 for upping the limits on worker cores / memory. We'll need to redeploy with a bumped version of the pangeo helm chart once that's in.
Thanks for everyone for working on this.
I am not convinced the original problem is resolved. I am unable to start any KubeClusters on ocean.pangeo.io.
I am doing this
from dask_kubernetes import KubeCluster
cluster = KubeCluster()
cluster.scale(4)
I am seeing this error message in my notebook.
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py:596> exception=AssertionError()>
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 603, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/deploy/spec.py", line 51, in _
assert self.status == "running"
AssertionError
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py:596> exception=AssertionError()>
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 603, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/deploy/spec.py", line 51, in _
assert self.status == "running"
AssertionError
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py:596> exception=AssertionError()>
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 603, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/deploy/spec.py", line 51, in _
assert self.status == "running"
AssertionError
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py:596> exception=AssertionError()>
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 603, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/deploy/spec.py", line 51, in _
assert self.status == "running"
AssertionError
There seems to be one message for each requested worker.
There is no tracer of the pods in kubernetes:
$ kubectl -n ocean-prod get pods
NAME READY STATUS RESTARTS AGE
api-ocean-prod-dask-gateway-5d544764dd-rrg6k 1/1 Running 0 4d
autohttps-8679569997-lszzw 2/2 Running 0 14d
continuous-image-puller-6pz7z 1/1 Running 0 5d
continuous-image-puller-prttd 1/1 Running 0 4d
controller-ocean-prod-dask-gateway-5b57d44cd-52d89 1/1 Running 0 13d
dask-0000-0002-3606-2575-24a3d70a-2lxzxz 1/1 Running 0 3d
dask-0000-0002-3606-2575-32802620-9jchts 1/1 Running 0 3d
dask-0000-0002-3606-2575-cac84f11-e9wpj8 1/1 Running 0 2d
dask-0000-0002-3606-2575-d8ee7cfd-evl87p 1/1 Running 0 2d
dask-0000-0002-3606-2575-e3e68daa-72mzbd 1/1 Running 0 2d
dask-gateway-2b89k 1/1 Running 0 34m
dask-gateway-8193be291e2c48ff84a9628779d2b732 1/1 Running 0 1d
dask-gateway-bjm6q 1/1 Running 0 1d
dask-gateway-dcd6324a022349e38466b11a32aea02f 1/1 Running 0 34m
dask-gateway-zv5zf 1/1 Running 0 1d
hub-744b5b544c-p5xdb 1/1 Running 0 4d
jupyter-0000-2d0001-2d5154-2d5009 1/1 Running 0 3d
jupyter-0000-2d0001-2d5234-2d177x 1/1 Running 0 2h
jupyter-0000-2d0001-2d5999-2d4917 1/1 Running 0 3m
jupyter-0000-2d0001-2d8991-2d5378 1/1 Running 0 57m
jupyter-0000-2d0002-2d1606-2d6982 1/1 Running 0 10m
jupyter-0000-2d0002-2d3254-2d1210 1/1 Running 0 4d
jupyter-0000-2d0002-2d3606-2d2575 1/1 Running 0 4d
jupyter-0000-2d0002-2d4313-2d2033 1/1 Running 0 3h
jupyter-0000-2d0002-2d8654-2d6009 1/1 Running 0 7h
jupyter-0000-2d0003-2d1094-2d0306 1/1 Running 0 2h
proxy-5c85bc5574-h57j5 1/1 Running 0 14d
traefik-ocean-prod-dask-gateway-59f4dfd9bd-vlgdd 1/1 Running 0 13d
user-scheduler-fb5f9548f-2lw5x 1/1 Running 0 13d
There are no pending or starting-up pods. Workers never show up.
(Gateway is working.)
This seems to be identical to https://github.com/PrefectHQ/prefect/issues/1841
This is due to an issue in your worker pod yaml or the proper RBAC isn't set up to create pods in the scheduler's current namespace
Confirmed that I can see that error too. Dunno why I didn't see it earlier (though I think we had a staging -> prod merge in the meantime).
Will take a look quick.
So I think to fix this, we would need to redeploy with the value daskkubernetes
in https://github.com/pangeo-data/helm-chart/blob/b1a230a88d2587713eb440c9de402770f6ae32e6/pangeo/values.yaml#L25 changed to pangeo
.
This gives the notebook Pods access to the kubernetes API so that they can launch other pods. @scottyhq would like to avoid enabling this for the icesat cluster. We can do this by changing pangeo.rbac.enabled
in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d78fe03712e416e77ee83d13c44a01bc883ae09f/pangeo-deploy/values.yaml#L8-L9 to be false, and enable it just for ocean. Is that worth doing?
Thanks for digging. I'm trying to understand what changed. It looks none of those config lines has changed for over a year...but nevertheless this KubeCluster
problem just appeared last week. Can you help me understand what else changed recently in our confic that would cause this to break?
I don't think it's worth a lot of work to maintain support for KubeCluster
. We should focus on transitioning to Dask Gateway. We have some downtime planned next week which we can use for that purpose. But it would be nice to get some kind of quick fix in for this, since the cluster is currently in a semi-broken state.
Thanks for digging. I'm trying to understand what changed.
The relevant line is setting the serviceAccountName
at https://github.com/pangeo-data/pangeo-cloud-federation/blob/d78fe03712e416e77ee83d13c44a01bc883ae09f/pangeo-deploy/values.yaml#L33. Previously that was daskkubernetes
, which had permissions to create pods.
This shows up in kubectl -n ocean-prod get pod <...> -o yaml
under spec.serviceAccount`.