dask-gateway
dask-gateway copied to clipboard
k8s controller: DaskCluster's replicas lowered, worker pods not deleted
Issue summary
I ended up with active but unused worker pods, unusable by the scheduler, blocking creation of new pods if a DaskCluster needs to scale up, but unable to scale down if a DaskCluster reduced its replica count.
My understanding
-
my_cluster.get_client()
return adask.distributed.Client
instance. - The client instance is connected to the associated cluster's scheduler and gets its information about workers from the scheduler.
- The scheduler is quick to adapt and scale down, and DaskCluster often quickly follow.
- The DaskCluster resource will be able to adapt upwards, but if the worker had previously been scaled away, it may end up being unused and unregistered by the scheduler.
Original report extracted from #246
GatewayCluster widget inconsistent with actual pods
After my images had been pulled and pods could start quickly, I reset my state by deleting clusters etc. Then:
- I created a new cluster and choose to adapt 0-5 workers in it.
- I ran a job with a client to the cluster that finished in 20 seconds and I had 5 workers according to the client for a while, and then 0 again.
- I observed that I still had 5 pods, and my DaskCluster had 5 replicas.
- The controller is in my mind thereby doing its job, it ensures 5 replicas of the workers.
- There is something wrong though, because the scheduler knew to delete a worker, but that didn't lead to a change in the DaskCluster resource, so the controller didn't remove any pods.
- I decided to try run my job again, would it add five new pods? It turns out no, it would instead error with a Timeout like if my workers failed to get started fast enough.
- I tried adapting to 6/6, that added one worker, and the client could observe 1 worker while there was 6 replicas according to the DaskCluster resource and I saw 6 pods.
- I ran my workload, and ended up doing work using a single worker.
- I tried adapting to 0-3 workers, but kept seeing 6 pods, and this were the logs of the controller, while the DaskCluster resource were updated to 1 replica.
[D 2020-04-14 17:33:11.857 KubeController] Event - MODIFIED cluster dask-gateway.c047a173d45247dd81f232ab60d692ca [I 2020-04-14 17:33:11.857 KubeController] Reconciling cluster dask-gateway.c047a173d45247dd81f232ab60d692ca [I 2020-04-14 17:33:11.862 KubeController] Finished reconciling cluster dask-gateway.c047a173d45247dd81f232ab60d692ca
- I tried running my workload, and it seems that my client never registered more than one replica at best.
This may be two separate issues. Hmm... or not?
- The controller fail to align correctly with the DaskCluster resource and never successfully delete surplus pods what the DaskCluster resource indicate it doesn't need.
- Workers that the scheduler has used once isn't reused later if adapted away, except perhaps one single worker.
Cluster adapt 0-5 can trigger as if it were 1-5
If I have a fresh cluster and press adapt 0-5, it doesn't create a worker for me, but I have ended up in a state where just going from 0-0 to 0-5 would add back a replica in the DaskCluster resource. I think the scheduled ended up thinking it kept needing one. This is a state I failed to reproduce quickly with a new cluster.
I'm having a bit of a hard time parsing the issue as described above. I'm going to describe how things are supposed to work below and hope that provides some understanding. In general I haven't seen any scaling issues when running dask-gateway on k8s, so I'm curious to know what's causing issues on your end (latency, different configuration, something else?).
When a user calls e.g. cluster.scale(5)
the following happens:
- The
GatewayCluster
sends an api request to the gateway api server indicating the cluster should have 5 workers - The api server validates the request, and forwards it to the correspond scheduler for that cluster
- The scheduler stores the new desired number of workers (call this
worker_count
) - The scheduler periodically computes
delta
fromworker_count - n_active_workers
. This happens periodically, as well whenever a new worker joins or leaves the cluster.- If
delta > 0
, it sends aheartbeat
api request to the gateway api server indicating it would like to haveworker_count
total workers. - If
delta < 0
, it sends aheartbeat
api request to the gateway api server indicating it would like to haveworker_count
total workers. It then actively shuts downdelta
workers. When these workers exit, the controller should notice and delete them. This should work regardless of thedask-worker
being the only container in the pod (multi-container pods won't move toSucceeded
/Failed
if only one container stops, but the controller should still figure this out).
- If
- When the api server gets a
heartbeat
api request, it updates the correspondingDaskCluster
replicas
field to the new desiredworker_count
. - The controller sees the updated
replicas
value and makes a decision based on the known set of pending & running pods for the corresponding cluster:- If
replicas < pending + running
, then new worker pods are submitted to scale up the cluster - If
replicas > pending + running
, then nothing happens. The controller relies on worker pods shutting themselves down when the scheduler tells them to.
- If
The above process is the same as if it were running with adaptive scaling, except that in that case the intial scale
request comes from the scheduler alone.
The biggest potential failure mode I can think is if the scheduler asks a worker to shut itself down, and it fails to do so. In this case the pod won't be noticed by the controller (since it's still running). I see this as a bug in distributed
, dask workers should themselves down cleanly.
Does the above make sense? If you're able to reproduce, the following would be helpful:
- Logs of the controller in
DEBUG
mode - Logs from the api server
- Logs from the corresponding scheduler with lingering workers
- Logs from the lingering workers
- The client-side code that led to this behavior (e.g.
cluster.scale(5)
).
Sorry for the messy report, it was a lot going on and I had not isolated an issue yet. Hopefully this can make up for it.
Reproduction
Summary
In the gif below, you will see me starting out from a blank slate with no DaskCluster resource, no Dask Scheduler, or Dask Workers for any scheduler. Then, I create a DaskCluster with new_cluster, and then I set adaptive scaling to min 1 max 3, and then I run some work which makes two more workers be added up to the maximum of three.
Soon after the job finishes it is everything i see reports back that there is only 1 worker, except the fact that I have three pods still. They are alive currently 24 minutes after not doing anything else.
GIF recording
Versions
Dask Gateway Helm chart v0.7.1
$ pip list | grep dask
dask 2.15.0
dask-gateway 0.7.1
dask-labextension 2.0.2
My JupyterLab environment use the same image as the worker pods, so all share the same versions of dask etc.
Logs
Controller, api server, scheduler, and three worker pods - all from the event in the GIF animation. These are the logs all the way 30 minutes after the job delegated to dask finished except for the controller where I trimmed away a repeating pattern after some iterations.
controller
[D 2020-05-04 19:42:45.874 KubeController] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:45.875 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:45.876 KubeController] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb scaled to 1 - creating 1 workers, deleting 0 stopped workers
[D 2020-05-04 19:42:45.903 KubeController] Created pod dask-gateway/dask-gateway-hfjr8
[D 2020-05-04 19:42:45.904 KubeController] Event - ADDED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 19:42:45.904 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:45.909 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[D 2020-05-04 19:42:45.928 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 19:42:45.929 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:45.929 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:46.609 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 19:42:46.609 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:46.609 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:47.191 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 19:42:47.192 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:47.192 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:54.873 KubeController] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:54.874 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:54.874 KubeController] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb scaled to 3 - creating 2 workers, deleting 0 stopped workers
[D 2020-05-04 19:42:54.893 KubeController] Created pod dask-gateway/dask-gateway-qv5rv
[D 2020-05-04 19:42:54.894 KubeController] Event - ADDED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 19:42:54.897 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 19:42:54.910 KubeController] Event - ADDED pod dask-gateway.dask-gateway-g5447
[D 2020-05-04 19:42:54.912 KubeController] Created pod dask-gateway/dask-gateway-g5447
[I 2020-05-04 19:42:54.912 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:54.914 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 19:42:54.916 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[D 2020-05-04 19:42:54.930 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 19:42:54.931 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:54.931 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:55.616 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[I 2020-05-04 19:42:55.616 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:55.617 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:55.628 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 19:42:55.629 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:55.630 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:56.272 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 19:42:56.272 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:56.273 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:56.273 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[I 2020-05-04 19:42:56.273 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:56.273 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:43:21.886 KubeController] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:43:21.887 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:43:21.888 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:45:20.958 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 19:45:20.971 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 19:45:20.972 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:45:20.973 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:50:31.071 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 19:50:31.084 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 19:50:31.085 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:50:31.086 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:55:41.187 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 19:55:41.204 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 19:55:41.205 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:55:41.206 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:00:51.329 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:00:51.341 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:00:51.343 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:00:51.343 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:06:01.435 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:06:01.446 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:06:01.447 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:06:01.447 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:11:11.565 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:11:11.576 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:11:11.577 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:11:11.577 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:16:21.678 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:16:21.692 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:16:21.693 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:16:21.693 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:21:31.812 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:21:31.825 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:21:31.826 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:21:31.827 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:25:29.199 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 20:25:29.199 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:25:29.199 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:25:29.205 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 20:25:29.205 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:25:29.205 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:25:29.209 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[I 2020-05-04 20:25:29.209 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:25:29.210 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:20.917 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 20:26:20.918 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:20.918 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:21.316 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 20:26:21.317 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:21.317 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:21.715 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[I 2020-05-04 20:26:21.716 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:21.716 KubeController] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb scaled to 1 - deleting 1 pending and 0 stopped workers
[D 2020-05-04 20:26:21.733 KubeController] Deleted pod dask-gateway/dask-gateway-qv5rv
[I 2020-05-04 20:26:21.734 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:21.734 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 20:26:22.916 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 20:26:23.117 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 20:26:23.121 KubeController] Event - DELETED pod dask-gateway.dask-gateway-qv5rv
[I 2020-05-04 20:26:23.121 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:23.122 KubeController] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb scaled to 1 - deleting 1 pending and 0 stopped workers
[D 2020-05-04 20:26:23.139 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[D 2020-05-04 20:26:23.140 KubeController] Deleted pod dask-gateway/dask-gateway-g5447
[I 2020-05-04 20:26:23.141 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:23.918 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[D 2020-05-04 20:26:24.522 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[D 2020-05-04 20:26:24.527 KubeController] Event - DELETED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 20:26:24.529 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:24.529 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:30.777 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 20:26:30.777 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:30.778 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:41.928 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:26:41.938 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:26:41.939 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:41.939 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
api
[I 2020-05-04 19:42:45.854 DaskGateway] 200 POST /api/v1/clusters/dask-gateway.a18bf505dd3d40f0a3178526278c86bb/adapt 9.012ms
[I 2020-05-04 19:42:45.859 DaskGateway] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb heartbeat [count: 1, n_active: 0, n_closing: 0, n_closed: 0]
[D 2020-05-04 19:42:45.874 DaskGateway] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:45.875 DaskGateway] Updating dask-gateway.a18bf505dd3d40f0a3178526278c86bb in cache
[I 2020-05-04 19:42:45.876 DaskGateway] 200 POST /api/v1/clusters/dask-gateway.a18bf505dd3d40f0a3178526278c86bb/heartbeat 17.362ms
[I 2020-05-04 19:42:54.863 DaskGateway] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb heartbeat [count: 3, n_active: 1, n_closing: 0, n_closed: 0]
[I 2020-05-04 19:42:54.873 DaskGateway] 200 POST /api/v1/clusters/dask-gateway.a18bf505dd3d40f0a3178526278c86bb/heartbeat 10.893ms
[D 2020-05-04 19:42:54.874 DaskGateway] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:54.875 DaskGateway] Updating dask-gateway.a18bf505dd3d40f0a3178526278c86bb in cache
[I 2020-05-04 19:43:21.873 DaskGateway] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb heartbeat [count: 1, n_active: 1, n_closing: 2, n_closed: 0]
[D 2020-05-04 19:43:21.887 DaskGateway] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:43:21.888 DaskGateway] Updating dask-gateway.a18bf505dd3d40f0a3178526278c86bb in cache
[I 2020-05-04 19:43:21.888 DaskGateway] 200 POST /api/v1/clusters/dask-gateway.a18bf505dd3d40f0a3178526278c86bb/heartbeat 15.711ms
scheduler
Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 59m jupyterhub-user-scheduler Successfully assigned dask-gateway/dask-gateway-a18bf505dd3d40f0a3178526278c86bb to gke-ds-platform-users-small-32f6a547-5t79
Normal Pulled 59m kubelet, gke-ds-platform-users-small-32f6a547-5t79 Container image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb" already present on machine
Normal Created 59m kubelet, gke-ds-platform-users-small-32f6a547-5t79 Created container dask-scheduler
Normal Started 59m kubelet, gke-ds-platform-users-small-32f6a547-5t79 Started container dask-scheduler
Initial logs
kubectl logs dask-gateway-a18bf505dd3d40f0a3178526278c86bb
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tls://10.72.5.174:8786
distributed.scheduler - INFO - dashboard at: :8787
distributed.scheduler - INFO - gateway at: :8788
dask_gateway.dask_cli - INFO - Enabling adaptive scaling, minimum=1, maximum=3
dask_gateway.dask_cli - INFO - Requesting scale to 1 workers from 0
distributed.scheduler - INFO - Register worker <Worker 'tls://10.72.7.5:37253', name: dask-gateway-hfjr8, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tls://10.72.7.5:37253
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-716b6394-8e3f-11ea-82c6-ce62591aacb0
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-726e185a-8e3f-11ea-800a-224f203b151c
distributed.core - INFO - Starting established connection
dask_gateway.dask_cli - INFO - Requesting scale to 3 workers from 1
distributed.scheduler - INFO - Register worker <Worker 'tls://10.72.7.7:37055', name: dask-gateway-g5447, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tls://10.72.7.7:37055
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tls://10.72.7.6:35359', name: dask-gateway-qv5rv, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tls://10.72.7.6:35359
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-74608546-8e3f-11ea-800a-fe33354d3c80
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-7461187a-8e3f-11ea-800b-2e1b9958b987
distributed.core - INFO - Starting established connection
dask_gateway.dask_cli - INFO - Requesting scale to 1 workers from 3
distributed.scheduler - INFO - Retire worker names ['dask-gateway-g5447', 'dask-gateway-qv5rv']
distributed.scheduler - INFO - Retire workers {<Worker 'tls://10.72.7.7:37055', name: dask-gateway-g5447, memory: 0, processing: 0>, <Worker 'tls://10.72.7.6:35359', name: dask-gateway-qv5rv, memory: 0, processing: 0>}
distributed.scheduler - INFO - Closing worker tls://10.72.7.7:37055
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.72.7.7:37055', name: dask-gateway-g5447, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.72.7.7:37055
distributed.scheduler - INFO - Closing worker tls://10.72.7.6:35359
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.72.7.6:35359', name: dask-gateway-qv5rv, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.72.7.6:35359
Appendix: log lines that arrived on scaledown
distributed.scheduler - INFO - Remove client Client-worker-726e185a-8e3f-11ea-800a-224f203b151c
distributed.scheduler - INFO - Remove client Client-worker-74608546-8e3f-11ea-800a-fe33354d3c80
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.72.7.5:37253', name: dask-gateway-hfjr8, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.72.7.5:37253
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Remove client Client-worker-7461187a-8e3f-11ea-800b-2e1b9958b987
distributed.scheduler - INFO - Close client connection: Client-worker-726e185a-8e3f-11ea-800a-224f203b151c
distributed.scheduler - INFO - Close client connection: Client-worker-74608546-8e3f-11ea-800a-fe33354d3c80
distributed.scheduler - INFO - Close client connection: Client-worker-7461187a-8e3f-11ea-800b-2e1b9958b987
distributed.scheduler - INFO - Register worker <Worker 'tls://10.72.7.2:46851', name: dask-gateway-hfjr8, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tls://10.72.7.2:46851
distributed.core - INFO - Starting established connection
traefik
These are not all the logs, but this is repeating over and over.
time="2020-05-04T20:32:09Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetescrd time="2020-05-04T20:32:10Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:10Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:10Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:10Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:11Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:11Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetescrd time="2020-05-04T20:32:11Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:12Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:12Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:12Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:12Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:13Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:13Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetescrd
Worker logs
dask-gateway-g5447
distributed.nanny - INFO - Start Nanny at: 'tls://10.72.7.7:35855'
distributed.worker - INFO - Start worker at: tls://10.72.7.7:37055
distributed.worker - INFO - Listening to: tls://10.72.7.7:37055
distributed.worker - INFO - dashboard at: 10.72.7.7:46369
distributed.worker - INFO - Waiting to connect to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 3.44 GB
distributed.worker - INFO - Local Directory: /home/jovyan/worker-oak09i88
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tls://10.72.7.7:37055
distributed.nanny - INFO - Worker closed
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0 ......
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0, score=0.102, total= 0.4s
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=1.0 ..............
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=1.0, score=0.975, total= 0.1s
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0 ..........
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total= 0.5s
[CV] tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0 ......
[CV] tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0, score=0.937, total= 0.1s
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01 ........
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01, score=0.098, total= 0.4s
[CV] tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06 ...........
[CV] tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06, score=0.098, total= 0.5s
[CV] tol=0.01, gamma=0.1, class_weight=balanced, C=10000.0 ...........
[CV] tol=0.01, gamma=0.1, class_weight=balanced, C=10000.0, score=0.102, total= 0.5s
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0 .........
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0, score=0.977, total= 0.1s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=100.0 ............
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=100.0, score=0.102, total= 0.4s
[CV] tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0 ..........
[CV] tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0, score=0.982, total= 0.2s
[CV] tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06 ...........
[CV] tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06, score=0.199, total= 0.5s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=0.01 .............
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=0.01, score=0.102, total= 0.4s
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0 .........
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0, score=0.970, total= 0.1s
[CV] tol=0.1, gamma=10.0, class_weight=None, C=0.001 .................
[CV] tol=0.1, gamma=10.0, class_weight=None, C=0.001, score=0.100, total= 0.4s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0 ..........
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0, score=0.927, total= 0.2s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0 ...........
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0, score=0.885, total= 0.4s
[CV] tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0 .......
[CV] tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.952, total= 0.1s
[CV] tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0 .........
[CV] tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0, score=0.102, total= 0.4s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=10.0 ............
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=10.0, score=0.102, total= 0.4s
[CV] tol=0.1, gamma=1e-05, class_weight=None, C=0.0001 ...............
[CV] tol=0.1, gamma=1e-05, class_weight=None, C=0.0001, score=0.102, total= 0.5s
[CV] tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0 .......
[CV] tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.952, total= 0.1s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0 ..........
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0, score=0.102, total= 0.4s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0 .......
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0, score=0.102, total= 0.4s
[CV] tol=0.1, gamma=1e-05, class_weight=None, C=0.0001 ...............
[CV] tol=0.1, gamma=1e-05, class_weight=None, C=0.0001, score=0.292, total= 0.5s
[CV] tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05 ..........
[CV] tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05, score=0.127, total= 0.5s
[CV] tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06 .............
[CV] tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06, score=0.098, total= 0.4s
[CV] tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05 ..........
[CV] tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05, score=0.145, total= 0.5s
[CV] tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06 .............
[CV] tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06, score=0.098, total= 0.4s
[CV] tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001 ..........
[CV] tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001, score=0.199, total= 0.5s
[CV] tol=0.0001, gamma=1e-05, class_weight=None, C=0.001 .............
[CV] tol=0.0001, gamma=1e-05, class_weight=None, C=0.001, score=0.102, total= 0.5s
[CV] tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0 ......
[CV] tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0, score=0.102, total= 0.4s
[CV] tol=0.0001, gamma=1e-05, class_weight=None, C=0.001 .............
[CV] tol=0.0001, gamma=1e-05, class_weight=None, C=0.001, score=0.292, total= 0.5s
[CV] tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0 ......
[CV] tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0, score=0.100, total= 0.4s
[CV] tol=0.01, gamma=10.0, class_weight=balanced, C=10.0 .............
[CV] tol=0.01, gamma=10.0, class_weight=balanced, C=10.0, score=0.102, total= 0.4s
[CV] tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0 ...........
[CV] tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total= 0.5s
[CV] tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0 ...........
[CV] tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0, score=0.945, total= 0.1s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=1e-06 ...........
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=1e-06, score=0.100, total= 0.4s
[CV] tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06 .............
[CV] tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06, score=0.102, total= 0.5s
[CV] tol=0.01, gamma=10.0, class_weight=balanced, C=10.0 .............
[CV] tol=0.01, gamma=10.0, class_weight=balanced, C=10.0, score=0.100, total= 0.4s
[CV] tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0 ...........
[CV] tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total= 0.5s
[CV] tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0 ...........
[CV] tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0, score=0.935, total= 0.1s
[CV] tol=0.1, gamma=10.0, class_weight=None, C=0.001 .................
[CV] tol=0.1, gamma=10.0, class_weight=None, C=0.001, score=0.102, total= 0.4s
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=0.01 ............
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=0.01, score=0.098, total= 0.5s
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01 ........
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01, score=0.098, total= 0.4s
[CV] tol=0.001, gamma=100.0, class_weight=None, C=100.0 ..............
[CV] tol=0.001, gamma=100.0, class_weight=None, C=100.0, score=0.102, total= 0.4s
dask-gateway-hfjr8
This was the survivor pod after 43 minutes where the others were killed. It had some events associated with it.
Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 52m default-scheduler Successfully assigned dask-gateway/dask-gateway-hfjr8 to gke-ds-platform-dask-workers-501ae6d4-nf6w
Normal Pulled 52m kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w Container image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb" already present on machine
Normal Created 52m kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w Created container dask-worker
Normal Started 52m kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w Started container dask-worker
Warning NetworkNotReady 8m55s (x6 over 9m5s) kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Normal Pulling 8m53s kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w Pulling image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb"
Normal Pulled 4m54s kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w Successfully pulled image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb"
Normal Created 4m32s kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w Created container dask-worker
Normal Started 4m32s kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w Started container dask-worker
New logs
distributed.nanny - INFO - Start Nanny at: 'tls://10.72.7.2:41213'
distributed.worker - INFO - Start worker at: tls://10.72.7.2:46851
distributed.worker - INFO - Listening to: tls://10.72.7.2:46851
distributed.worker - INFO - dashboard at: 10.72.7.2:40351
distributed.worker - INFO - Waiting to connect to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 3.44 GB
distributed.worker - INFO - Local Directory: /home/jovyan/worker-mtc6ezkk
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
Initial logs
distributed.nanny - INFO - Start Nanny at: 'tls://10.72.7.5:45977'
distributed.worker - INFO - Start worker at: tls://10.72.7.5:37253
distributed.worker - INFO - Listening to: tls://10.72.7.5:37253
distributed.worker - INFO - dashboard at: 10.72.7.5:34891
distributed.worker - INFO - Waiting to connect to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 3.44 GB
distributed.worker - INFO - Local Directory: /home/jovyan/worker-3dgxrsmt
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=1.0 ..............
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=1.0, score=0.968, total= 0.1s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0 ..........
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0, score=0.952, total= 0.2s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0 ..........
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0, score=0.943, total= 0.2s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0 ......
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0, score=0.102, total= 0.4s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0 ......
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0, score=0.102, total= 0.5s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0 ......
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0, score=0.100, total= 0.4s
[CV] tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0 ........
[CV] tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.952, total= 0.1s
[CV] tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0 ........
[CV] tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.963, total= 0.1s
[CV] tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0 ........
[CV] tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.952, total= 0.1s
[CV] tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0 ............
[CV] tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0, score=0.937, total= 0.1s
[CV] tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0 ............
[CV] tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0, score=0.957, total= 0.1s
[CV] tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0 ............
[CV] tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0, score=0.938, total= 0.1s
[CV] tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0 ..........
[CV] tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0, score=0.952, total= 0.1s
[CV] tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0 ..........
[CV] tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0, score=0.963, total= 0.1s
[CV] tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0 ..........
[CV] tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0, score=0.952, total= 0.1s
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0 ......
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0, score=0.102, total= 0.4s
[CV] tol=0.01, gamma=1e-06, class_weight=None, C=100000.0 ............
[CV] tol=0.01, gamma=1e-06, class_weight=None, C=100000.0, score=0.938, total= 0.1s
[CV] tol=0.01, gamma=1e-06, class_weight=None, C=100000.0 ............
[CV] tol=0.01, gamma=1e-06, class_weight=None, C=100000.0, score=0.957, total= 0.1s
[CV] tol=0.01, gamma=1e-06, class_weight=None, C=100000.0 ............
[CV] tol=0.01, gamma=1e-06, class_weight=None, C=100000.0, score=0.937, total= 0.1s
[CV] tol=0.0001, gamma=0.01, class_weight=None, C=0.001 ..............
[CV] tol=0.0001, gamma=0.01, class_weight=None, C=0.001, score=0.102, total= 0.5s
[CV] tol=0.0001, gamma=0.01, class_weight=None, C=0.001 ..............
[CV] tol=0.0001, gamma=0.01, class_weight=None, C=0.001, score=0.102, total= 0.5s
[CV] tol=0.0001, gamma=0.01, class_weight=None, C=0.001 ..............
[CV] tol=0.0001, gamma=0.01, class_weight=None, C=0.001, score=0.164, total= 0.5s
[CV] tol=0.0001, gamma=1e-06, class_weight=None, C=0.1 ...............
[CV] tol=0.0001, gamma=1e-06, class_weight=None, C=0.1, score=0.102, total= 0.5s
[CV] tol=0.0001, gamma=1e-06, class_weight=None, C=0.1 ...............
[CV] tol=0.0001, gamma=1e-06, class_weight=None, C=0.1, score=0.102, total= 0.5s
[CV] tol=0.0001, gamma=1e-06, class_weight=None, C=0.1 ...............
[CV] tol=0.0001, gamma=1e-06, class_weight=None, C=0.1, score=0.292, total= 0.5s
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0 ..........
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total= 0.5s
[CV] tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1 .............
[CV] tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1, score=0.098, total= 0.5s
[CV] tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1 .............
[CV] tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1, score=0.197, total= 0.5s
[CV] tol=0.01, gamma=10000.0, class_weight=None, C=0.1 ...............
[CV] tol=0.01, gamma=10000.0, class_weight=None, C=0.1, score=0.102, total= 0.4s
[CV] tol=0.01, gamma=10000.0, class_weight=None, C=0.1 ...............
[CV] tol=0.01, gamma=10000.0, class_weight=None, C=0.1, score=0.102, total= 0.4s
[CV] tol=0.01, gamma=10000.0, class_weight=None, C=0.1 ...............
[CV] tol=0.01, gamma=10000.0, class_weight=None, C=0.1, score=0.100, total= 0.4s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0 .........
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0, score=0.102, total= 0.4s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0 .........
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0, score=0.102, total= 0.4s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0 .........
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0, score=0.100, total= 0.4s
[CV] tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001 ......
[CV] tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001, score=0.100, total= 0.4s
[CV] tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001 ......
[CV] tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001, score=0.100, total= 0.4s
[CV] tol=0.01, gamma=100000.0, class_weight=None, C=100.0 ............
[CV] tol=0.01, gamma=100000.0, class_weight=None, C=100.0, score=0.102, total= 0.4s
[CV] tol=0.01, gamma=100000.0, class_weight=None, C=100.0 ............
[CV] tol=0.01, gamma=100000.0, class_weight=None, C=100.0, score=0.100, total= 0.4s
[CV] tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01 .......
[CV] tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01, score=0.098, total= 0.4s
[CV] tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01 .......
[CV] tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01, score=0.098, total= 0.4s
[CV] tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01 .......
[CV] tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01, score=0.098, total= 0.4s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0 ............
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0, score=0.098, total= 0.5s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0 ............
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0, score=0.098, total= 0.5s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0 ............
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0, score=0.197, total= 0.5s
[CV] tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0 ......
[CV] tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0, score=0.937, total= 0.1s
[CV] tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0 ......
[CV] tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0, score=0.960, total= 0.1s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=0.001 ...........
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=0.001, score=0.102, total= 0.4s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=0.001 ...........
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=0.001, score=0.102, total= 0.4s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=0.001 ...........
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=0.001, score=0.100, total= 0.4s
[CV] tol=0.01, gamma=1e-07, class_weight=balanced, C=10000.0 .........
[CV] tol=0.01, gamma=1e-07, class_weight=balanced, C=10000.0, score=0.943, total= 0.1s
[CV] tol=0.01, gamma=1e-07, class_weight=balanced, C=10000.0 .........
[CV] tol=0.01, gamma=1e-07, class_weight=balanced, C=10000.0, score=0.967, total= 0.1s
dask-gateway-qv5rv
distributed.nanny - INFO - Start Nanny at: 'tls://10.72.7.6:41247'
distributed.worker - INFO - Start worker at: tls://10.72.7.6:35359
distributed.worker - INFO - Listening to: tls://10.72.7.6:35359
distributed.worker - INFO - dashboard at: 10.72.7.6:37121
distributed.worker - INFO - Waiting to connect to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 3.44 GB
distributed.worker - INFO - Local Directory: /home/jovyan/worker-fgp5axq_
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tls://10.72.7.6:35359
distributed.nanny - INFO - Worker closed
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0 ......
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0, score=0.100, total= 0.4s
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=1.0 ..............
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=1.0, score=0.982, total= 0.1s
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01 ........
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01, score=0.098, total= 0.4s
[CV] tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001 ......
[CV] tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001, score=0.098, total= 0.4s
[CV] tol=0.01, gamma=0.1, class_weight=balanced, C=10000.0 ...........
[CV] tol=0.01, gamma=0.1, class_weight=balanced, C=10000.0, score=0.102, total= 0.5s
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0 ..........
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total= 0.5s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=100.0 ............
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=100.0, score=0.102, total= 0.4s
[CV] tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0 ..........
[CV] tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0, score=0.975, total= 0.2s
[CV] tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06 ...........
[CV] tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06, score=0.098, total= 0.5s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=0.01 .............
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=0.01, score=0.102, total= 0.4s
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0 .........
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0, score=0.982, total= 0.1s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=100.0 ............
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=100.0, score=0.100, total= 0.4s
[CV] tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0 ..........
[CV] tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0, score=0.972, total= 0.2s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0 ...........
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0, score=0.873, total= 0.4s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=0.01 .............
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=0.01, score=0.100, total= 0.4s
[CV] tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0 .........
[CV] tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0, score=0.102, total= 0.4s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=10.0 ............
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=10.0, score=0.102, total= 0.4s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0 ...........
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0, score=0.898, total= 0.4s
[CV] tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0 .......
[CV] tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.963, total= 0.1s
[CV] tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0 .........
[CV] tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0, score=0.100, total= 0.4s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=10.0 ............
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=10.0, score=0.100, total= 0.4s
[CV] tol=0.1, gamma=1e-05, class_weight=None, C=0.0001 ...............
[CV] tol=0.1, gamma=1e-05, class_weight=None, C=0.0001, score=0.102, total= 0.5s
[CV] tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05 ..........
[CV] tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05, score=0.132, total= 0.5s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0 ..........
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0, score=0.102, total= 0.4s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0 .......
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0, score=0.102, total= 0.4s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0 ..........
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0, score=0.100, total= 0.4s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0 .......
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0, score=0.100, total= 0.4s
[CV] tol=0.0001, gamma=1e-05, class_weight=None, C=0.001 .............
[CV] tol=0.0001, gamma=1e-05, class_weight=None, C=0.001, score=0.102, total= 0.5s
[CV] tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0 ......
[CV] tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0, score=0.102, total= 0.5s
[CV] tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06 .............
[CV] tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06, score=0.199, total= 0.4s
[CV] tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001 ..........
[CV] tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001, score=0.199, total= 0.5s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=1e-06 ...........
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=1e-06, score=0.102, total= 0.4s
[CV] tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001 ..........
[CV] tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001, score=0.197, total= 0.5s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=1e-06 ...........
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=1e-06, score=0.102, total= 0.4s
[CV] tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06 .............
[CV] tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06, score=0.102, total= 0.5s
[CV] tol=0.01, gamma=10.0, class_weight=balanced, C=10.0 .............
[CV] tol=0.01, gamma=10.0, class_weight=balanced, C=10.0, score=0.102, total= 0.4s
[CV] tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0 ...........
[CV] tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total= 0.5s
[CV] tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0 ...........
[CV] tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0, score=0.965, total= 0.1s
[CV] tol=0.1, gamma=10.0, class_weight=None, C=0.001 .................
[CV] tol=0.1, gamma=10.0, class_weight=None, C=0.001, score=0.102, total= 0.4s
[CV] tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06 .............
[CV] tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06, score=0.292, total= 0.5s
[CV] tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1 .............
[CV] tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1, score=0.098, total= 0.5s
[CV] tol=0.01, gamma=100000.0, class_weight=None, C=100.0 ............
[CV] tol=0.01, gamma=100000.0, class_weight=None, C=100.0, score=0.102, total= 0.4s
Client side code
from dask_gateway import Gateway
gateway = Gateway()
# optional
options = gateway.cluster_options()
options
cluster = gateway.new_cluster(options)
cluster
# WIDGET:
# I set adaptive scaling from min 1 to max 3
client = cluster.get_client()
import numpy as np
import joblib
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
digits = load_digits()
param_space = {
'C': np.logspace(-6, 6, 13),
'gamma': np.logspace(-8, 8, 17),
'tol': np.logspace(-4, -1, 4),
'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)
%%time
# with Dask workers
with joblib.parallel_backend('dask'):
search.fit(digits.data, digits.target)
Hmmm, well that's not right. We have tests for this that pass locally, but perhaps something odd is going on. Thanks for the logs and gif - it looks like the scheduler thinks the workers have left, but the workers didn't actually shut down like they should have (a bug in distributed). Did you add any sidecar containers to the worker pods? Or are they running with only the standard dask-worker
container?
@jcrist aha okay, they run with a custom large image with versions reported above. It is the same image for my jupyterhub user environments.
I'm going to update the logs to contain the latest event, after 43 minutes, three pods went away, and one new came after ~43 minutes of inactivity.
aha okay, they run with a custom large image with versions reported above. It is the same image for my jupyterhub user environments.
Sorry, by "sidecare container" I meant did you add any extra container definitions to the pod? Pods running multiple containers are managed a bit differently by the controller than pods running with only the standard single container. Both should work, but how I'd debug would be determined by that.
I'm going to update the logs to contain the latest event, after 43 minutes, three pods went away, and one new came after ~43 minutes of inactivity.
Interesting. Looks like the worker pods restarted, (but maybe not right away? your logs arent' 100% clear to me).
If the workers are shutting down, but not successfully, I wonder if changing the restartPolicy
for worker pods to Never
would fix your issue. The following config should do this (untested):
gateway:
backend:
worker:
extraPodConfig:
restartPolicy: Never
@jcrist no sidecar containers for anything!
While I can add a restartPolicy: Never
, woudn't an event about a container restarting showed up in the events of the worker pod though? Or would they? Yes they would I'm now confident, it would say something like this:
25m Normal Pulling pod/dask-gateway-hfjr8 Pulling image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb"
21m Normal Pulled pod/dask-gateway-hfjr8 Successfully pulled image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb"
21m Normal Created pod/dask-gateway-hfjr8 Created container dask-worker
21m Normal Started pod/dask-gateway-hfjr8 Started container dask-worker
I only saw that once, and that was when the 3 pods became 1 ~43 min in, and the one pod still running experienced a restart.
Or hmm... I only checked the events of one worker pod.
I'll try add the restartpolicy to all pods no matter what and repeat the experiment.
Ah, just noticed you'd included the pod events for one of the workers. Yeah, you should see events for a container restart.
I only saw that once, and that was when the 3 pods became 1 ~43 min in, and the one pod still running experienced a restart.
To clarify, did you see the two pods die and the remaining pod restart all at the same time (43 min in)?
A confusion for me the dask-gateway-hfjr8
worker pod that is old as if it never was removed, and obviously it has restarted its container since it was started later than the pod was created and the logs from the old run dissappeared. But... Still it sais restartCount: 0
on it. Perhaps the pod was modified with a kubectl apply or similarly?
# kubectl get pod -o yaml dask-gateway-hfjr8
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/podIP: 10.72.7.2/32
creationTimestamp: "2020-05-04T19:42:45Z"
generateName: dask-gateway-
labels:
app.kubernetes.io/component: dask-worker
app.kubernetes.io/name: dask-gateway
app.kubernetes.io/version: 0.7.1
gateway.dask.org/cluster: a18bf505dd3d40f0a3178526278c86bb
gateway.dask.org/instance: dask-gateway
name: dask-gateway-hfjr8
namespace: dask-gateway
ownerReferences:
- apiVersion: v1
kind: Pod
name: dask-gateway-a18bf505dd3d40f0a3178526278c86bb
uid: 2887fc05-14ce-4bd5-b559-475fc903aaa1
resourceVersion: "196048650"
selfLink: /api/v1/namespaces/dask-gateway/pods/dask-gateway-hfjr8
uid: 71bdc5c4-5fbd-4745-8026-bbbedd90f1fc
spec:
automountServiceAccountToken: false
containers:
- args:
- dask-gateway-worker
- --nthreads
- "1"
- --memory-limit
- "3435973836"
- --scheduler-address
- tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
env:
- name: DASK_GATEWAY_API_URL
value: http://api-dask-gateway.dask-gateway:8000/api
- name: DASK_GATEWAY_CLUSTER_NAME
value: dask-gateway.a18bf505dd3d40f0a3178526278c86bb
- name: DASK_GATEWAY_API_TOKEN
value: /etc/dask-credentials/api-token
- name: DASK_GATEWAY_TLS_CERT
value: /etc/dask-credentials/dask.crt
- name: DASK_GATEWAY_TLS_KEY
value: /etc/dask-credentials/dask.pem
- name: DASK_GATEWAY_WORKER_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: eu.gcr.io/ds-platform/dsp-user-environment:d44defb
imagePullPolicy: IfNotPresent
name: dask-worker
ports:
- containerPort: 8786
name: scheduler
protocol: TCP
- containerPort: 8787
name: dashboard
protocol: TCP
- containerPort: 8788
name: api
protocol: TCP
resources:
limits:
cpu: "1"
memory: "3435973836"
requests:
cpu: 100m
memory: "3"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/dask-credentials/
name: dask-credentials
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: gke-ds-platform-dask-workers-501ae6d4-nf6w
nodeSelector:
node-purpose: dask-workers
priority: 0
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: dask-workers
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: dask-credentials
secret:
defaultMode: 420
secretName: dask-gateway-a18bf505dd3d40f0a3178526278c86bb
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-05-04T19:42:45Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2020-05-04T20:30:52Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2020-05-04T20:30:52Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2020-05-04T19:42:45Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://88992fa88a7114216f623df73d0dd1a89e94769bd2399beba2ab923a2ff42728
image: eu.gcr.io/ds-platform/dsp-user-environment:d44defb
imageID: docker-pullable://eu.gcr.io/ds-platform/dsp-user-environment@sha256:39f98ee0a35551e0e7d4ad9bf26ac4edf00bcbd3a9efa4d74e764f051b32e924
lastState: {}
name: dask-worker
ready: true
restartCount: 0
state:
running:
startedAt: "2020-05-04T20:30:51Z"
hostIP: 10.68.0.30
phase: Running
podIP: 10.72.7.2
qosClass: Burstable
startTime: "2020-05-04T19:42:45Z"
But... Still it says
restartCount: 0
on it. Perhaps the pod was modified with a kubectl apply or similarly?
Not sure? Dask-gateway doesn't handle any of that, so if the pod controller didn't reset it to 0 then something else might have. We only observe pods, not modify them.
@jcrist, ah the scheduler creates the pods, and that is dask.distributed that owns the scheduler? Ah.
To clarify, did you see the two pods die and the remaining pod restart all at the same time (43 min in)?
Yepp! Or at least the three worker pods were alive and well 38 minutes in, and then 43 minutes in i saw a single worker pod being in a ContainerCreating status.
According to https://stackoverflow.com/a/59658670, the restartCount
is computed according to the number of dead containers that have yet to be cleaned up. If the kubelet is cleaning up containers, it will cause that number to be reset.
@jcrist, ah the scheduler creates the pods, and that is dask.distributed that owns the scheduler? Ah.
No, sorry. dask-gateway
creates and manages the scheduler and worker pods, but once a pod is created we only obseve it, we don't ever update a created pod (e.g. for a restart). The k8s pod controller is responsible for handling that. dask.distributed
doesn't know or do anything with k8s.
According to https://stackoverflow.com/a/59658670, the restartCount is computed according to the number of dead containers that have yet to be cleaned up. If the kubelet is cleaning up containers, it will cause that number to be reset.
Nice catch! Ah, well that was not very intuitive considering the name they have on the field.
No, sorry. dask-gateway creates and manages the scheduler and worker pods, but once a pod is created we only obseve it, we don't ever update a created pod (e.g. for a restart). The k8s pod controller is responsible for handling that. dask.distributed doesn't know or do anything with k8s.
Ah!
My ideas of the situation now:
- DaskGateway's api-server creates a DaskCluster resource, and the controller adds worker pods and a scheduler pod for the DaskCluster resource, the API Server reads information from the daskcluster resource and reports back to a DaskGateway client speaking with it?
- If a worker pod die, they have died from a shutdown, not because something issued a
kubectl delete
equivalent k8s api request. - The controller acting on the DaskCluster resource will attempt to add worker pods if needed. If pods have been added that lingers after a scaledown, the next scale up will probably not lead to new worker pods because the controller will see the lingering pods even though the worker pods connections to the scheduler pod may have been severed.
A PodSpec has a restartPolicy field with possible values Always, OnFailure, and Never. The default value is Always. restartPolicy applies to all Containers in the Pod.
I observe that the worker pod currently have a policy of restartPolicy: OnFailure
though.
I repeated the experiment with a restartPolicy of Never, and there was no difference, but I expect there to be a difference when the event occurring ~43 minutes in if the times will be similar. Because, at that point in time we had a container restart due to a failure I think, while two pods had container completions without failure, I think.
It seems like this is the trigger that is meant to make the workers die, but they don't. Are they meant to die from this? Oh.... wait...
distributed.scheduler - INFO - Retire workers {<Worker 'tls://10.72.7.5:46847', name: dask-gateway-fc4ds, memory: 0, processing: 0>, <Worker 'tls://10.72.7.4:45735', name: dask-gateway-jjdpm, memory: 0, processing: 0>}
distributed.scheduler - INFO - Closing worker tls://10.72.7.5:46847
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.72.7.5:46847', name: dask-gateway-fc4ds, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.72.7.5:46847
distributed.scheduler - INFO - Closing worker tls://10.72.7.4:45735
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.72.7.4:45735', name: dask-gateway-jjdpm, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.72.7.4:45735
Perhaps it matters that I'm using this entrypoint in my image?
ENTRYPOINT ["tini", "-g", "--"]
UPDATE: No, because the dask-worker image also does that. As can be seen in https://github.com/dask/dask-gateway/blob/master/resources/helm/images/dask-gateway-base/Dockerfile.
Only args
of the container is overridden, which in k8s lingo maps to CMD in Dockerfile lingo (command
in k8s maps to ENTRYPOINT
in Dockerfiles).
Args:
dask-gateway-worker
--nthreads
1
--memory-limit
3435973836
--scheduler-address
tls://dask-gateway-0c66d56247e044259ab211cfdb603d3c.dask-gateway:8786
Hmmm, there goes that theory. It seems like there might be a bug in distributed. I'll take a look later this week. What version of distributed
do you have in your image?
It seems like this is the trigger that is meant to make the workers die, but they don't. Are they meant to die from this? Oh.... wait...
Yeah, that's the scheduler saying that it has told the worker to close and disconnected it, but the worker is still remaining up (a bug).
Version
Dask Helm chart v0.7.1
My JupyterHub user environment use this image, which is also used by the worker pods.
$ pip list | grep dask
dask 2.15.0
dask-gateway 0.7.1
dask-labextension 2.0.2
In the Dockerfile i install this in pip.
dask[complete] \
dask-gateway==0.7.1 \
dask_labextension \
I also install the dask-labextension
jupyterlab extension, and does a jupyter serverextension enable --sys-prefix --py dask_labextension
.
My JupyterHub user environment use this image, which is also used by the worker pods.
Sorry, the important library is distributed
(not dask
) - what version of that do you have installed?
$ pip list | grep distributed
distributed 2.15.2