dask-gateway icon indicating copy to clipboard operation
dask-gateway copied to clipboard

k8s controller: DaskCluster's replicas lowered, worker pods not deleted

Open consideRatio opened this issue 4 years ago • 18 comments

Issue summary

I ended up with active but unused worker pods, unusable by the scheduler, blocking creation of new pods if a DaskCluster needs to scale up, but unable to scale down if a DaskCluster reduced its replica count.

My understanding

  • my_cluster.get_client() return a dask.distributed.Client instance.
  • The client instance is connected to the associated cluster's scheduler and gets its information about workers from the scheduler.
  • The scheduler is quick to adapt and scale down, and DaskCluster often quickly follow.
  • The DaskCluster resource will be able to adapt upwards, but if the worker had previously been scaled away, it may end up being unused and unregistered by the scheduler.

Original report extracted from #246

GatewayCluster widget inconsistent with actual pods

After my images had been pulled and pods could start quickly, I reset my state by deleting clusters etc. Then:

  1. I created a new cluster and choose to adapt 0-5 workers in it.
  2. I ran a job with a client to the cluster that finished in 20 seconds and I had 5 workers according to the client for a while, and then 0 again.
  3. I observed that I still had 5 pods, and my DaskCluster had 5 replicas.
    • The controller is in my mind thereby doing its job, it ensures 5 replicas of the workers.
    • There is something wrong though, because the scheduler knew to delete a worker, but that didn't lead to a change in the DaskCluster resource, so the controller didn't remove any pods.
  4. I decided to try run my job again, would it add five new pods? It turns out no, it would instead error with a Timeout like if my workers failed to get started fast enough.
  5. I tried adapting to 6/6, that added one worker, and the client could observe 1 worker while there was 6 replicas according to the DaskCluster resource and I saw 6 pods.
  6. I ran my workload, and ended up doing work using a single worker.
  7. I tried adapting to 0-3 workers, but kept seeing 6 pods, and this were the logs of the controller, while the DaskCluster resource were updated to 1 replica.
    [D 2020-04-14 17:33:11.857 KubeController] Event - MODIFIED cluster dask-gateway.c047a173d45247dd81f232ab60d692ca
    [I 2020-04-14 17:33:11.857 KubeController] Reconciling cluster dask-gateway.c047a173d45247dd81f232ab60d692ca
    [I 2020-04-14 17:33:11.862 KubeController] Finished reconciling cluster dask-gateway.c047a173d45247dd81f232ab60d692ca
    
  8. I tried running my workload, and it seems that my client never registered more than one replica at best.

This may be two separate issues. Hmm... or not?

  1. The controller fail to align correctly with the DaskCluster resource and never successfully delete surplus pods what the DaskCluster resource indicate it doesn't need.
  2. Workers that the scheduler has used once isn't reused later if adapted away, except perhaps one single worker.

Cluster adapt 0-5 can trigger as if it were 1-5

If I have a fresh cluster and press adapt 0-5, it doesn't create a worker for me, but I have ended up in a state where just going from 0-0 to 0-5 would add back a replica in the DaskCluster resource. I think the scheduled ended up thinking it kept needing one. This is a state I failed to reproduce quickly with a new cluster.

consideRatio avatar Apr 14 '20 19:04 consideRatio

I'm having a bit of a hard time parsing the issue as described above. I'm going to describe how things are supposed to work below and hope that provides some understanding. In general I haven't seen any scaling issues when running dask-gateway on k8s, so I'm curious to know what's causing issues on your end (latency, different configuration, something else?).

When a user calls e.g. cluster.scale(5) the following happens:

  • The GatewayCluster sends an api request to the gateway api server indicating the cluster should have 5 workers
  • The api server validates the request, and forwards it to the correspond scheduler for that cluster
  • The scheduler stores the new desired number of workers (call this worker_count)
  • The scheduler periodically computes delta from worker_count - n_active_workers. This happens periodically, as well whenever a new worker joins or leaves the cluster.
    • If delta > 0, it sends a heartbeat api request to the gateway api server indicating it would like to have worker_count total workers.
    • If delta < 0, it sends a heartbeat api request to the gateway api server indicating it would like to have worker_count total workers. It then actively shuts down delta workers. When these workers exit, the controller should notice and delete them. This should work regardless of the dask-worker being the only container in the pod (multi-container pods won't move to Succeeded/Failed if only one container stops, but the controller should still figure this out).
  • When the api server gets a heartbeat api request, it updates the corresponding DaskCluster replicas field to the new desired worker_count.
  • The controller sees the updated replicas value and makes a decision based on the known set of pending & running pods for the corresponding cluster:
    • If replicas < pending + running, then new worker pods are submitted to scale up the cluster
    • If replicas > pending + running, then nothing happens. The controller relies on worker pods shutting themselves down when the scheduler tells them to.

The above process is the same as if it were running with adaptive scaling, except that in that case the intial scale request comes from the scheduler alone.

The biggest potential failure mode I can think is if the scheduler asks a worker to shut itself down, and it fails to do so. In this case the pod won't be noticed by the controller (since it's still running). I see this as a bug in distributed, dask workers should themselves down cleanly.


Does the above make sense? If you're able to reproduce, the following would be helpful:

  • Logs of the controller in DEBUG mode
  • Logs from the api server
  • Logs from the corresponding scheduler with lingering workers
  • Logs from the lingering workers
  • The client-side code that led to this behavior (e.g. cluster.scale(5)).

jcrist avatar Apr 24 '20 21:04 jcrist

Sorry for the messy report, it was a lot going on and I had not isolated an issue yet. Hopefully this can make up for it.

Reproduction

Summary

In the gif below, you will see me starting out from a blank slate with no DaskCluster resource, no Dask Scheduler, or Dask Workers for any scheduler. Then, I create a DaskCluster with new_cluster, and then I set adaptive scaling to min 1 max 3, and then I run some work which makes two more workers be added up to the maximum of three.

Soon after the job finishes it is everything i see reports back that there is only 1 worker, except the fact that I have three pods still. They are alive currently 24 minutes after not doing anything else.

GIF recording

pods-surviving

Versions

Dask Gateway Helm chart v0.7.1

$ pip list | grep dask
dask                          2.15.0
dask-gateway                  0.7.1
dask-labextension             2.0.2

My JupyterLab environment use the same image as the worker pods, so all share the same versions of dask etc.

Logs

Controller, api server, scheduler, and three worker pods - all from the event in the GIF animation. These are the logs all the way 30 minutes after the job delegated to dask finished except for the controller where I trimmed away a repeating pattern after some iterations.

controller

[D 2020-05-04 19:42:45.874 KubeController] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:45.875 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:45.876 KubeController] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb scaled to 1 - creating 1 workers, deleting 0 stopped workers
[D 2020-05-04 19:42:45.903 KubeController] Created pod dask-gateway/dask-gateway-hfjr8
[D 2020-05-04 19:42:45.904 KubeController] Event - ADDED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 19:42:45.904 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:45.909 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[D 2020-05-04 19:42:45.928 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 19:42:45.929 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:45.929 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:46.609 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 19:42:46.609 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:46.609 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:47.191 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 19:42:47.192 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:47.192 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:54.873 KubeController] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:54.874 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:54.874 KubeController] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb scaled to 3 - creating 2 workers, deleting 0 stopped workers
[D 2020-05-04 19:42:54.893 KubeController] Created pod dask-gateway/dask-gateway-qv5rv
[D 2020-05-04 19:42:54.894 KubeController] Event - ADDED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 19:42:54.897 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 19:42:54.910 KubeController] Event - ADDED pod dask-gateway.dask-gateway-g5447
[D 2020-05-04 19:42:54.912 KubeController] Created pod dask-gateway/dask-gateway-g5447
[I 2020-05-04 19:42:54.912 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:54.914 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 19:42:54.916 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[D 2020-05-04 19:42:54.930 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 19:42:54.931 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:54.931 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:55.616 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[I 2020-05-04 19:42:55.616 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:55.617 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:55.628 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 19:42:55.629 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:55.630 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:56.272 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 19:42:56.272 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:56.273 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:56.273 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[I 2020-05-04 19:42:56.273 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:42:56.273 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:43:21.886 KubeController] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:43:21.887 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:43:21.888 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:45:20.958 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 19:45:20.971 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 19:45:20.972 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:45:20.973 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:50:31.071 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 19:50:31.084 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 19:50:31.085 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:50:31.086 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:55:41.187 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 19:55:41.204 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 19:55:41.205 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 19:55:41.206 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:00:51.329 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:00:51.341 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:00:51.343 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:00:51.343 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:06:01.435 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:06:01.446 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:06:01.447 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:06:01.447 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:11:11.565 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:11:11.576 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:11:11.577 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:11:11.577 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:16:21.678 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:16:21.692 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:16:21.693 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:16:21.693 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:21:31.812 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:21:31.825 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:21:31.826 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:21:31.827 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:25:29.199 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 20:25:29.199 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:25:29.199 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:25:29.205 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 20:25:29.205 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:25:29.205 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:25:29.209 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[I 2020-05-04 20:25:29.209 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:25:29.210 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:20.917 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 20:26:20.918 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:20.918 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:21.316 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 20:26:21.317 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:21.317 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:21.715 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[I 2020-05-04 20:26:21.716 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:21.716 KubeController] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb scaled to 1 - deleting 1 pending and 0 stopped workers
[D 2020-05-04 20:26:21.733 KubeController] Deleted pod dask-gateway/dask-gateway-qv5rv
[I 2020-05-04 20:26:21.734 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:21.734 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 20:26:22.916 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 20:26:23.117 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-qv5rv
[D 2020-05-04 20:26:23.121 KubeController] Event - DELETED pod dask-gateway.dask-gateway-qv5rv
[I 2020-05-04 20:26:23.121 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:23.122 KubeController] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb scaled to 1 - deleting 1 pending and 0 stopped workers
[D 2020-05-04 20:26:23.139 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[D 2020-05-04 20:26:23.140 KubeController] Deleted pod dask-gateway/dask-gateway-g5447
[I 2020-05-04 20:26:23.141 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:23.918 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[D 2020-05-04 20:26:24.522 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-g5447
[D 2020-05-04 20:26:24.527 KubeController] Event - DELETED pod dask-gateway.dask-gateway-g5447
[I 2020-05-04 20:26:24.529 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:24.529 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:30.777 KubeController] Event - MODIFIED pod dask-gateway.dask-gateway-hfjr8
[I 2020-05-04 20:26:30.777 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:30.778 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 20:26:41.928 KubeController] Too old resourceVersion in endpoints informer, relisting
[D 2020-05-04 20:26:41.938 KubeController] Relisted endpoints informer - 0 deletes, 1 updates
[I 2020-05-04 20:26:41.939 KubeController] Reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[I 2020-05-04 20:26:41.939 KubeController] Finished reconciling cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
api

[I 2020-05-04 19:42:45.854 DaskGateway] 200 POST /api/v1/clusters/dask-gateway.a18bf505dd3d40f0a3178526278c86bb/adapt 9.012ms
[I 2020-05-04 19:42:45.859 DaskGateway] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb heartbeat [count: 1, n_active: 0, n_closing: 0, n_closed: 0]
[D 2020-05-04 19:42:45.874 DaskGateway] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:45.875 DaskGateway] Updating dask-gateway.a18bf505dd3d40f0a3178526278c86bb in cache
[I 2020-05-04 19:42:45.876 DaskGateway] 200 POST /api/v1/clusters/dask-gateway.a18bf505dd3d40f0a3178526278c86bb/heartbeat 17.362ms
[I 2020-05-04 19:42:54.863 DaskGateway] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb heartbeat [count: 3, n_active: 1, n_closing: 0, n_closed: 0]
[I 2020-05-04 19:42:54.873 DaskGateway] 200 POST /api/v1/clusters/dask-gateway.a18bf505dd3d40f0a3178526278c86bb/heartbeat 10.893ms
[D 2020-05-04 19:42:54.874 DaskGateway] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:42:54.875 DaskGateway] Updating dask-gateway.a18bf505dd3d40f0a3178526278c86bb in cache
[I 2020-05-04 19:43:21.873 DaskGateway] Cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb heartbeat [count: 1, n_active: 1, n_closing: 2, n_closed: 0]
[D 2020-05-04 19:43:21.887 DaskGateway] Event - MODIFIED cluster dask-gateway.a18bf505dd3d40f0a3178526278c86bb
[D 2020-05-04 19:43:21.888 DaskGateway] Updating dask-gateway.a18bf505dd3d40f0a3178526278c86bb in cache
[I 2020-05-04 19:43:21.888 DaskGateway] 200 POST /api/v1/clusters/dask-gateway.a18bf505dd3d40f0a3178526278c86bb/heartbeat 15.711ms
scheduler

Events

Events:
  Type    Reason     Age   From                                                Message
  ----    ------     ----  ----                                                -------
  Normal  Scheduled  59m   jupyterhub-user-scheduler                           Successfully assigned dask-gateway/dask-gateway-a18bf505dd3d40f0a3178526278c86bb to gke-ds-platform-users-small-32f6a547-5t79
  Normal  Pulled     59m   kubelet, gke-ds-platform-users-small-32f6a547-5t79  Container image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb" already present on machine
  Normal  Created    59m   kubelet, gke-ds-platform-users-small-32f6a547-5t79  Created container dask-scheduler
  Normal  Started    59m   kubelet, gke-ds-platform-users-small-32f6a547-5t79  Started container dask-scheduler

Initial logs

kubectl logs dask-gateway-a18bf505dd3d40f0a3178526278c86bb
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:    tls://10.72.5.174:8786
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.scheduler - INFO -     gateway at:                     :8788
dask_gateway.dask_cli - INFO - Enabling adaptive scaling, minimum=1, maximum=3
dask_gateway.dask_cli - INFO - Requesting scale to 1 workers from 0
distributed.scheduler - INFO - Register worker <Worker 'tls://10.72.7.5:37253', name: dask-gateway-hfjr8, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tls://10.72.7.5:37253
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-716b6394-8e3f-11ea-82c6-ce62591aacb0
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-726e185a-8e3f-11ea-800a-224f203b151c
distributed.core - INFO - Starting established connection
dask_gateway.dask_cli - INFO - Requesting scale to 3 workers from 1
distributed.scheduler - INFO - Register worker <Worker 'tls://10.72.7.7:37055', name: dask-gateway-g5447, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tls://10.72.7.7:37055
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tls://10.72.7.6:35359', name: dask-gateway-qv5rv, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tls://10.72.7.6:35359
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-74608546-8e3f-11ea-800a-fe33354d3c80
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-7461187a-8e3f-11ea-800b-2e1b9958b987
distributed.core - INFO - Starting established connection
dask_gateway.dask_cli - INFO - Requesting scale to 1 workers from 3
distributed.scheduler - INFO - Retire worker names ['dask-gateway-g5447', 'dask-gateway-qv5rv']
distributed.scheduler - INFO - Retire workers {<Worker 'tls://10.72.7.7:37055', name: dask-gateway-g5447, memory: 0, processing: 0>, <Worker 'tls://10.72.7.6:35359', name: dask-gateway-qv5rv, memory: 0, processing: 0>}
distributed.scheduler - INFO - Closing worker tls://10.72.7.7:37055
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.72.7.7:37055', name: dask-gateway-g5447, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.72.7.7:37055
distributed.scheduler - INFO - Closing worker tls://10.72.7.6:35359
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.72.7.6:35359', name: dask-gateway-qv5rv, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.72.7.6:35359

Appendix: log lines that arrived on scaledown

distributed.scheduler - INFO - Remove client Client-worker-726e185a-8e3f-11ea-800a-224f203b151c
distributed.scheduler - INFO - Remove client Client-worker-74608546-8e3f-11ea-800a-fe33354d3c80
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.72.7.5:37253', name: dask-gateway-hfjr8, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.72.7.5:37253
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Remove client Client-worker-7461187a-8e3f-11ea-800b-2e1b9958b987
distributed.scheduler - INFO - Close client connection: Client-worker-726e185a-8e3f-11ea-800a-224f203b151c
distributed.scheduler - INFO - Close client connection: Client-worker-74608546-8e3f-11ea-800a-fe33354d3c80
distributed.scheduler - INFO - Close client connection: Client-worker-7461187a-8e3f-11ea-800b-2e1b9958b987
distributed.scheduler - INFO - Register worker <Worker 'tls://10.72.7.2:46851', name: dask-gateway-hfjr8, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tls://10.72.7.2:46851
distributed.core - INFO - Starting established connection
traefik

These are not all the logs, but this is repeating over and over.

time="2020-05-04T20:32:09Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetescrd time="2020-05-04T20:32:10Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:10Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:10Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:10Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:11Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:11Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetescrd time="2020-05-04T20:32:11Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:12Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:12Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:12Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:12Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:13Z" level=debug msg="Dropping event kind *v1.Endpoints due to throttling" providerName=kubernetescrd time="2020-05-04T20:32:13Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetescrd

Worker logs

dask-gateway-g5447

distributed.nanny - INFO -         Start Nanny at: 'tls://10.72.7.7:35855'
distributed.worker - INFO -       Start worker at:      tls://10.72.7.7:37055
distributed.worker - INFO -          Listening to:      tls://10.72.7.7:37055
distributed.worker - INFO -          dashboard at:            10.72.7.7:46369
distributed.worker - INFO - Waiting to connect to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    3.44 GB
distributed.worker - INFO -       Local Directory: /home/jovyan/worker-oak09i88
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tls://10.72.7.7:37055
distributed.nanny - INFO - Worker closed
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0 ......
[CV]  tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0, score=0.102, total=   0.4s
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=1.0 ..............
[CV]  tol=0.1, gamma=0.001, class_weight=balanced, C=1.0, score=0.975, total=   0.1s
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0 ..........
[CV]  tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total=   0.5s
[CV] tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0 ......
[CV]  tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0, score=0.937, total=   0.1s
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01 ........
[CV]  tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01, score=0.098, total=   0.4s
[CV] tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06 ...........
[CV]  tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06, score=0.098, total=   0.5s
[CV] tol=0.01, gamma=0.1, class_weight=balanced, C=10000.0 ...........
[CV]  tol=0.01, gamma=0.1, class_weight=balanced, C=10000.0, score=0.102, total=   0.5s
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0 .........
[CV]  tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0, score=0.977, total=   0.1s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=100.0 ............
[CV]  tol=0.001, gamma=10000.0, class_weight=None, C=100.0, score=0.102, total=   0.4s
[CV] tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0 ..........
[CV]  tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0, score=0.982, total=   0.2s
[CV] tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06 ...........
[CV]  tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06, score=0.199, total=   0.5s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=0.01 .............
[CV]  tol=0.001, gamma=10000.0, class_weight=None, C=0.01, score=0.102, total=   0.4s
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0 .........
[CV]  tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0, score=0.970, total=   0.1s
[CV] tol=0.1, gamma=10.0, class_weight=None, C=0.001 .................
[CV]  tol=0.1, gamma=10.0, class_weight=None, C=0.001, score=0.100, total=   0.4s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0 ..........
[CV]  tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0, score=0.927, total=   0.2s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0 ...........
[CV]  tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0, score=0.885, total=   0.4s
[CV] tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0 .......
[CV]  tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.952, total=   0.1s
[CV] tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0 .........
[CV]  tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0, score=0.102, total=   0.4s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=10.0 ............
[CV]  tol=0.001, gamma=100000.0, class_weight=None, C=10.0, score=0.102, total=   0.4s
[CV] tol=0.1, gamma=1e-05, class_weight=None, C=0.0001 ...............
[CV]  tol=0.1, gamma=1e-05, class_weight=None, C=0.0001, score=0.102, total=   0.5s
[CV] tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0 .......
[CV]  tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.952, total=   0.1s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0 ..........
[CV]  tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0, score=0.102, total=   0.4s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0 .......
[CV]  tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0, score=0.102, total=   0.4s
[CV] tol=0.1, gamma=1e-05, class_weight=None, C=0.0001 ...............
[CV]  tol=0.1, gamma=1e-05, class_weight=None, C=0.0001, score=0.292, total=   0.5s
[CV] tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05 ..........
[CV]  tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05, score=0.127, total=   0.5s
[CV] tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06 .............
[CV]  tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06, score=0.098, total=   0.4s
[CV] tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05 ..........
[CV]  tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05, score=0.145, total=   0.5s
[CV] tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06 .............
[CV]  tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06, score=0.098, total=   0.4s
[CV] tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001 ..........
[CV]  tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001, score=0.199, total=   0.5s
[CV] tol=0.0001, gamma=1e-05, class_weight=None, C=0.001 .............
[CV]  tol=0.0001, gamma=1e-05, class_weight=None, C=0.001, score=0.102, total=   0.5s
[CV] tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0 ......
[CV]  tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0, score=0.102, total=   0.4s
[CV] tol=0.0001, gamma=1e-05, class_weight=None, C=0.001 .............
[CV]  tol=0.0001, gamma=1e-05, class_weight=None, C=0.001, score=0.292, total=   0.5s
[CV] tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0 ......
[CV]  tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0, score=0.100, total=   0.4s
[CV] tol=0.01, gamma=10.0, class_weight=balanced, C=10.0 .............
[CV]  tol=0.01, gamma=10.0, class_weight=balanced, C=10.0, score=0.102, total=   0.4s
[CV] tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0 ...........
[CV]  tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total=   0.5s
[CV] tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0 ...........
[CV]  tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0, score=0.945, total=   0.1s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=1e-06 ...........
[CV]  tol=0.001, gamma=100000.0, class_weight=None, C=1e-06, score=0.100, total=   0.4s
[CV] tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06 .............
[CV]  tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06, score=0.102, total=   0.5s
[CV] tol=0.01, gamma=10.0, class_weight=balanced, C=10.0 .............
[CV]  tol=0.01, gamma=10.0, class_weight=balanced, C=10.0, score=0.100, total=   0.4s
[CV] tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0 ...........
[CV]  tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total=   0.5s
[CV] tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0 ...........
[CV]  tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0, score=0.935, total=   0.1s
[CV] tol=0.1, gamma=10.0, class_weight=None, C=0.001 .................
[CV]  tol=0.1, gamma=10.0, class_weight=None, C=0.001, score=0.102, total=   0.4s
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=0.01 ............
[CV]  tol=0.0001, gamma=0.1, class_weight=balanced, C=0.01, score=0.098, total=   0.5s
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01 ........
[CV]  tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01, score=0.098, total=   0.4s
[CV] tol=0.001, gamma=100.0, class_weight=None, C=100.0 ..............
[CV]  tol=0.001, gamma=100.0, class_weight=None, C=100.0, score=0.102, total=   0.4s
dask-gateway-hfjr8

This was the survivor pod after 43 minutes where the others were killed. It had some events associated with it.

Events

Events:
  Type     Reason           Age                   From                                                 Message
  ----     ------           ----                  ----                                                 -------
  Normal   Scheduled        52m                   default-scheduler                                    Successfully assigned dask-gateway/dask-gateway-hfjr8 to gke-ds-platform-dask-workers-501ae6d4-nf6w
  Normal   Pulled           52m                   kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w  Container image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb" already present on machine
  Normal   Created          52m                   kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w  Created container dask-worker
  Normal   Started          52m                   kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w  Started container dask-worker
  Warning  NetworkNotReady  8m55s (x6 over 9m5s)  kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w  network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
  Normal   Pulling          8m53s                 kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w  Pulling image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb"
  Normal   Pulled           4m54s                 kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w  Successfully pulled image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb"
  Normal   Created          4m32s                 kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w  Created container dask-worker
  Normal   Started          4m32s                 kubelet, gke-ds-platform-dask-workers-501ae6d4-nf6w  Started container dask-worker

New logs

distributed.nanny - INFO -         Start Nanny at: 'tls://10.72.7.2:41213'
distributed.worker - INFO -       Start worker at:      tls://10.72.7.2:46851
distributed.worker - INFO -          Listening to:      tls://10.72.7.2:46851
distributed.worker - INFO -          dashboard at:            10.72.7.2:40351
distributed.worker - INFO - Waiting to connect to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    3.44 GB
distributed.worker - INFO -       Local Directory: /home/jovyan/worker-mtc6ezkk
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

Initial logs

distributed.nanny - INFO -         Start Nanny at: 'tls://10.72.7.5:45977'
distributed.worker - INFO -       Start worker at:      tls://10.72.7.5:37253
distributed.worker - INFO -          Listening to:      tls://10.72.7.5:37253
distributed.worker - INFO -          dashboard at:            10.72.7.5:34891
distributed.worker - INFO - Waiting to connect to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    3.44 GB
distributed.worker - INFO -       Local Directory: /home/jovyan/worker-3dgxrsmt
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=1.0 ..............
[CV]  tol=0.1, gamma=0.001, class_weight=balanced, C=1.0, score=0.968, total=   0.1s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0 ..........
[CV]  tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0, score=0.952, total=   0.2s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0 ..........
[CV]  tol=0.001, gamma=1e-06, class_weight=balanced, C=100.0, score=0.943, total=   0.2s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0 ......
[CV]  tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0, score=0.102, total=   0.4s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0 ......
[CV]  tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0, score=0.102, total=   0.5s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0 ......
[CV]  tol=0.01, gamma=10000000.0, class_weight=None, C=1000000.0, score=0.100, total=   0.4s
[CV] tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0 ........
[CV]  tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.952, total=   0.1s
[CV] tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0 ........
[CV]  tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.963, total=   0.1s
[CV] tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0 ........
[CV]  tol=0.001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.952, total=   0.1s
[CV] tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0 ............
[CV]  tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0, score=0.937, total=   0.1s
[CV] tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0 ............
[CV]  tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0, score=0.957, total=   0.1s
[CV] tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0 ............
[CV]  tol=0.1, gamma=1e-08, class_weight=None, C=1000000.0, score=0.938, total=   0.1s
[CV] tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0 ..........
[CV]  tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0, score=0.952, total=   0.1s
[CV] tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0 ..........
[CV]  tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0, score=0.963, total=   0.1s
[CV] tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0 ..........
[CV]  tol=0.01, gamma=0.0001, class_weight=None, C=1000000.0, score=0.952, total=   0.1s
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0 ......
[CV]  tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0, score=0.102, total=   0.4s
[CV] tol=0.01, gamma=1e-06, class_weight=None, C=100000.0 ............
[CV]  tol=0.01, gamma=1e-06, class_weight=None, C=100000.0, score=0.938, total=   0.1s
[CV] tol=0.01, gamma=1e-06, class_weight=None, C=100000.0 ............
[CV]  tol=0.01, gamma=1e-06, class_weight=None, C=100000.0, score=0.957, total=   0.1s
[CV] tol=0.01, gamma=1e-06, class_weight=None, C=100000.0 ............
[CV]  tol=0.01, gamma=1e-06, class_weight=None, C=100000.0, score=0.937, total=   0.1s
[CV] tol=0.0001, gamma=0.01, class_weight=None, C=0.001 ..............
[CV]  tol=0.0001, gamma=0.01, class_weight=None, C=0.001, score=0.102, total=   0.5s
[CV] tol=0.0001, gamma=0.01, class_weight=None, C=0.001 ..............
[CV]  tol=0.0001, gamma=0.01, class_weight=None, C=0.001, score=0.102, total=   0.5s
[CV] tol=0.0001, gamma=0.01, class_weight=None, C=0.001 ..............
[CV]  tol=0.0001, gamma=0.01, class_weight=None, C=0.001, score=0.164, total=   0.5s
[CV] tol=0.0001, gamma=1e-06, class_weight=None, C=0.1 ...............
[CV]  tol=0.0001, gamma=1e-06, class_weight=None, C=0.1, score=0.102, total=   0.5s
[CV] tol=0.0001, gamma=1e-06, class_weight=None, C=0.1 ...............
[CV]  tol=0.0001, gamma=1e-06, class_weight=None, C=0.1, score=0.102, total=   0.5s
[CV] tol=0.0001, gamma=1e-06, class_weight=None, C=0.1 ...............
[CV]  tol=0.0001, gamma=1e-06, class_weight=None, C=0.1, score=0.292, total=   0.5s
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0 ..........
[CV]  tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total=   0.5s
[CV] tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1 .............
[CV]  tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1, score=0.098, total=   0.5s
[CV] tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1 .............
[CV]  tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1, score=0.197, total=   0.5s
[CV] tol=0.01, gamma=10000.0, class_weight=None, C=0.1 ...............
[CV]  tol=0.01, gamma=10000.0, class_weight=None, C=0.1, score=0.102, total=   0.4s
[CV] tol=0.01, gamma=10000.0, class_weight=None, C=0.1 ...............
[CV]  tol=0.01, gamma=10000.0, class_weight=None, C=0.1, score=0.102, total=   0.4s
[CV] tol=0.01, gamma=10000.0, class_weight=None, C=0.1 ...............
[CV]  tol=0.01, gamma=10000.0, class_weight=None, C=0.1, score=0.100, total=   0.4s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0 .........
[CV]  tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0, score=0.102, total=   0.4s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0 .........
[CV]  tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0, score=0.102, total=   0.4s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0 .........
[CV]  tol=0.0001, gamma=10.0, class_weight=balanced, C=1000.0, score=0.100, total=   0.4s
[CV] tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001 ......
[CV]  tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001, score=0.100, total=   0.4s
[CV] tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001 ......
[CV]  tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001, score=0.100, total=   0.4s
[CV] tol=0.01, gamma=100000.0, class_weight=None, C=100.0 ............
[CV]  tol=0.01, gamma=100000.0, class_weight=None, C=100.0, score=0.102, total=   0.4s
[CV] tol=0.01, gamma=100000.0, class_weight=None, C=100.0 ............
[CV]  tol=0.01, gamma=100000.0, class_weight=None, C=100.0, score=0.100, total=   0.4s
[CV] tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01 .......
[CV]  tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01, score=0.098, total=   0.4s
[CV] tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01 .......
[CV]  tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01, score=0.098, total=   0.4s
[CV] tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01 .......
[CV]  tol=0.001, gamma=1000000.0, class_weight=balanced, C=0.01, score=0.098, total=   0.4s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0 ............
[CV]  tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0, score=0.098, total=   0.5s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0 ............
[CV]  tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0, score=0.098, total=   0.5s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0 ............
[CV]  tol=0.001, gamma=1e-06, class_weight=balanced, C=1.0, score=0.197, total=   0.5s
[CV] tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0 ......
[CV]  tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0, score=0.937, total=   0.1s
[CV] tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0 ......
[CV]  tol=0.001, gamma=1e-08, class_weight=balanced, C=1000000.0, score=0.960, total=   0.1s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=0.001 ...........
[CV]  tol=0.001, gamma=100000.0, class_weight=None, C=0.001, score=0.102, total=   0.4s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=0.001 ...........
[CV]  tol=0.001, gamma=100000.0, class_weight=None, C=0.001, score=0.102, total=   0.4s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=0.001 ...........
[CV]  tol=0.001, gamma=100000.0, class_weight=None, C=0.001, score=0.100, total=   0.4s
[CV] tol=0.01, gamma=1e-07, class_weight=balanced, C=10000.0 .........
[CV]  tol=0.01, gamma=1e-07, class_weight=balanced, C=10000.0, score=0.943, total=   0.1s
[CV] tol=0.01, gamma=1e-07, class_weight=balanced, C=10000.0 .........
[CV]  tol=0.01, gamma=1e-07, class_weight=balanced, C=10000.0, score=0.967, total=   0.1s
dask-gateway-qv5rv

distributed.nanny - INFO -         Start Nanny at: 'tls://10.72.7.6:41247'
distributed.worker - INFO -       Start worker at:      tls://10.72.7.6:35359
distributed.worker - INFO -          Listening to:      tls://10.72.7.6:35359
distributed.worker - INFO -          dashboard at:            10.72.7.6:37121
distributed.worker - INFO - Waiting to connect to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    3.44 GB
distributed.worker - INFO -       Local Directory: /home/jovyan/worker-fgp5axq_
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tls://10.72.7.6:35359
distributed.nanny - INFO - Worker closed
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0 ......
[CV]  tol=0.1, gamma=10000000.0, class_weight=balanced, C=1000.0, score=0.100, total=   0.4s
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=1.0 ..............
[CV]  tol=0.1, gamma=0.001, class_weight=balanced, C=1.0, score=0.982, total=   0.1s
[CV] tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01 ........
[CV]  tol=0.1, gamma=10000000.0, class_weight=balanced, C=0.01, score=0.098, total=   0.4s
[CV] tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001 ......
[CV]  tol=0.01, gamma=1000000.0, class_weight=balanced, C=0.0001, score=0.098, total=   0.4s
[CV] tol=0.01, gamma=0.1, class_weight=balanced, C=10000.0 ...........
[CV]  tol=0.01, gamma=0.1, class_weight=balanced, C=10000.0, score=0.102, total=   0.5s
[CV] tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0 ..........
[CV]  tol=0.0001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total=   0.5s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=100.0 ............
[CV]  tol=0.001, gamma=10000.0, class_weight=None, C=100.0, score=0.102, total=   0.4s
[CV] tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0 ..........
[CV]  tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0, score=0.975, total=   0.2s
[CV] tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06 ...........
[CV]  tol=0.01, gamma=0.001, class_weight=balanced, C=1e-06, score=0.098, total=   0.5s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=0.01 .............
[CV]  tol=0.001, gamma=10000.0, class_weight=None, C=0.01, score=0.102, total=   0.4s
[CV] tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0 .........
[CV]  tol=0.1, gamma=0.001, class_weight=balanced, C=100000.0, score=0.982, total=   0.1s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=100.0 ............
[CV]  tol=0.001, gamma=10000.0, class_weight=None, C=100.0, score=0.100, total=   0.4s
[CV] tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0 ..........
[CV]  tol=0.0001, gamma=0.001, class_weight=balanced, C=10.0, score=0.972, total=   0.2s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0 ...........
[CV]  tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0, score=0.873, total=   0.4s
[CV] tol=0.001, gamma=10000.0, class_weight=None, C=0.01 .............
[CV]  tol=0.001, gamma=10000.0, class_weight=None, C=0.01, score=0.100, total=   0.4s
[CV] tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0 .........
[CV]  tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0, score=0.102, total=   0.4s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=10.0 ............
[CV]  tol=0.001, gamma=100000.0, class_weight=None, C=10.0, score=0.102, total=   0.4s
[CV] tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0 ...........
[CV]  tol=0.001, gamma=1e-06, class_weight=balanced, C=10.0, score=0.898, total=   0.4s
[CV] tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0 .......
[CV]  tol=0.0001, gamma=0.0001, class_weight=balanced, C=1000.0, score=0.963, total=   0.1s
[CV] tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0 .........
[CV]  tol=0.1, gamma=10000000.0, class_weight=None, C=10000.0, score=0.100, total=   0.4s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=10.0 ............
[CV]  tol=0.001, gamma=100000.0, class_weight=None, C=10.0, score=0.100, total=   0.4s
[CV] tol=0.1, gamma=1e-05, class_weight=None, C=0.0001 ...............
[CV]  tol=0.1, gamma=1e-05, class_weight=None, C=0.0001, score=0.102, total=   0.5s
[CV] tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05 ..........
[CV]  tol=0.0001, gamma=0.01, class_weight=balanced, C=1e-05, score=0.132, total=   0.5s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0 ..........
[CV]  tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0, score=0.102, total=   0.4s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0 .......
[CV]  tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0, score=0.102, total=   0.4s
[CV] tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0 ..........
[CV]  tol=0.0001, gamma=10.0, class_weight=balanced, C=100.0, score=0.100, total=   0.4s
[CV] tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0 .......
[CV]  tol=0.01, gamma=10000000.0, class_weight=None, C=100000.0, score=0.100, total=   0.4s
[CV] tol=0.0001, gamma=1e-05, class_weight=None, C=0.001 .............
[CV]  tol=0.0001, gamma=1e-05, class_weight=None, C=0.001, score=0.102, total=   0.5s
[CV] tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0 ......
[CV]  tol=0.0001, gamma=1000000.0, class_weight=None, C=100000.0, score=0.102, total=   0.5s
[CV] tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06 .............
[CV]  tol=0.01, gamma=1.0, class_weight=balanced, C=1e-06, score=0.199, total=   0.4s
[CV] tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001 ..........
[CV]  tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001, score=0.199, total=   0.5s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=1e-06 ...........
[CV]  tol=0.001, gamma=100000.0, class_weight=None, C=1e-06, score=0.102, total=   0.4s
[CV] tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001 ..........
[CV]  tol=0.01, gamma=1e-06, class_weight=balanced, C=0.0001, score=0.197, total=   0.5s
[CV] tol=0.001, gamma=100000.0, class_weight=None, C=1e-06 ...........
[CV]  tol=0.001, gamma=100000.0, class_weight=None, C=1e-06, score=0.102, total=   0.4s
[CV] tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06 .............
[CV]  tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06, score=0.102, total=   0.5s
[CV] tol=0.01, gamma=10.0, class_weight=balanced, C=10.0 .............
[CV]  tol=0.01, gamma=10.0, class_weight=balanced, C=10.0, score=0.102, total=   0.4s
[CV] tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0 ...........
[CV]  tol=0.001, gamma=0.1, class_weight=balanced, C=1000.0, score=0.102, total=   0.5s
[CV] tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0 ...........
[CV]  tol=0.1, gamma=1e-06, class_weight=balanced, C=1000.0, score=0.965, total=   0.1s
[CV] tol=0.1, gamma=10.0, class_weight=None, C=0.001 .................
[CV]  tol=0.1, gamma=10.0, class_weight=None, C=0.001, score=0.102, total=   0.4s
[CV] tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06 .............
[CV]  tol=0.0001, gamma=1e-07, class_weight=None, C=1e-06, score=0.292, total=   0.5s
[CV] tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1 .............
[CV]  tol=0.01, gamma=1e-08, class_weight=balanced, C=0.1, score=0.098, total=   0.5s
[CV] tol=0.01, gamma=100000.0, class_weight=None, C=100.0 ............
[CV]  tol=0.01, gamma=100000.0, class_weight=None, C=100.0, score=0.102, total=   0.4s

Client side code

from dask_gateway import Gateway
gateway = Gateway()

# optional
options = gateway.cluster_options()
options
cluster = gateway.new_cluster(options)
cluster

# WIDGET:
# I set adaptive scaling from min 1 to max 3
client = cluster.get_client()
import numpy as np
import joblib
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC

digits = load_digits()

param_space = {
    'C': np.logspace(-6, 6, 13),
    'gamma': np.logspace(-8, 8, 17),
    'tol': np.logspace(-4, -1, 4),
    'class_weight': [None, 'balanced'],
}

model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)
%%time
# with Dask workers
with joblib.parallel_backend('dask'):
    search.fit(digits.data, digits.target)

consideRatio avatar May 04 '20 20:05 consideRatio

Hmmm, well that's not right. We have tests for this that pass locally, but perhaps something odd is going on. Thanks for the logs and gif - it looks like the scheduler thinks the workers have left, but the workers didn't actually shut down like they should have (a bug in distributed). Did you add any sidecar containers to the worker pods? Or are they running with only the standard dask-worker container?

jcrist avatar May 04 '20 20:05 jcrist

@jcrist aha okay, they run with a custom large image with versions reported above. It is the same image for my jupyterhub user environments.

I'm going to update the logs to contain the latest event, after 43 minutes, three pods went away, and one new came after ~43 minutes of inactivity.

consideRatio avatar May 04 '20 20:05 consideRatio

aha okay, they run with a custom large image with versions reported above. It is the same image for my jupyterhub user environments.

Sorry, by "sidecare container" I meant did you add any extra container definitions to the pod? Pods running multiple containers are managed a bit differently by the controller than pods running with only the standard single container. Both should work, but how I'd debug would be determined by that.

I'm going to update the logs to contain the latest event, after 43 minutes, three pods went away, and one new came after ~43 minutes of inactivity.

Interesting. Looks like the worker pods restarted, (but maybe not right away? your logs arent' 100% clear to me).

If the workers are shutting down, but not successfully, I wonder if changing the restartPolicy for worker pods to Never would fix your issue. The following config should do this (untested):

gateway:
  backend:
    worker:
      extraPodConfig:
        restartPolicy: Never

jcrist avatar May 04 '20 20:05 jcrist

@jcrist no sidecar containers for anything!

While I can add a restartPolicy: Never, woudn't an event about a container restarting showed up in the events of the worker pod though? Or would they? Yes they would I'm now confident, it would say something like this:

25m         Normal    Pulling           pod/dask-gateway-hfjr8   Pulling image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb"
21m         Normal    Pulled            pod/dask-gateway-hfjr8   Successfully pulled image "eu.gcr.io/ds-platform/dsp-user-environment:d44defb"
21m         Normal    Created           pod/dask-gateway-hfjr8   Created container dask-worker
21m         Normal    Started           pod/dask-gateway-hfjr8   Started container dask-worker

I only saw that once, and that was when the 3 pods became 1 ~43 min in, and the one pod still running experienced a restart.

Or hmm... I only checked the events of one worker pod.

I'll try add the restartpolicy to all pods no matter what and repeat the experiment.

consideRatio avatar May 04 '20 20:05 consideRatio

Ah, just noticed you'd included the pod events for one of the workers. Yeah, you should see events for a container restart.

I only saw that once, and that was when the 3 pods became 1 ~43 min in, and the one pod still running experienced a restart.

To clarify, did you see the two pods die and the remaining pod restart all at the same time (43 min in)?

jcrist avatar May 04 '20 20:05 jcrist

A confusion for me the dask-gateway-hfjr8 worker pod that is old as if it never was removed, and obviously it has restarted its container since it was started later than the pod was created and the logs from the old run dissappeared. But... Still it sais restartCount: 0 on it. Perhaps the pod was modified with a kubectl apply or similarly?

# kubectl get pod -o yaml dask-gateway-hfjr8
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 10.72.7.2/32
  creationTimestamp: "2020-05-04T19:42:45Z"
  generateName: dask-gateway-
  labels:
    app.kubernetes.io/component: dask-worker
    app.kubernetes.io/name: dask-gateway
    app.kubernetes.io/version: 0.7.1
    gateway.dask.org/cluster: a18bf505dd3d40f0a3178526278c86bb
    gateway.dask.org/instance: dask-gateway
  name: dask-gateway-hfjr8
  namespace: dask-gateway
  ownerReferences:
  - apiVersion: v1
    kind: Pod
    name: dask-gateway-a18bf505dd3d40f0a3178526278c86bb
    uid: 2887fc05-14ce-4bd5-b559-475fc903aaa1
  resourceVersion: "196048650"
  selfLink: /api/v1/namespaces/dask-gateway/pods/dask-gateway-hfjr8
  uid: 71bdc5c4-5fbd-4745-8026-bbbedd90f1fc
spec:
  automountServiceAccountToken: false
  containers:
  - args:
    - dask-gateway-worker
    - --nthreads
    - "1"
    - --memory-limit
    - "3435973836"
    - --scheduler-address
    - tls://dask-gateway-a18bf505dd3d40f0a3178526278c86bb.dask-gateway:8786
    env:
    - name: DASK_GATEWAY_API_URL
      value: http://api-dask-gateway.dask-gateway:8000/api
    - name: DASK_GATEWAY_CLUSTER_NAME
      value: dask-gateway.a18bf505dd3d40f0a3178526278c86bb
    - name: DASK_GATEWAY_API_TOKEN
      value: /etc/dask-credentials/api-token
    - name: DASK_GATEWAY_TLS_CERT
      value: /etc/dask-credentials/dask.crt
    - name: DASK_GATEWAY_TLS_KEY
      value: /etc/dask-credentials/dask.pem
    - name: DASK_GATEWAY_WORKER_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    image: eu.gcr.io/ds-platform/dsp-user-environment:d44defb
    imagePullPolicy: IfNotPresent
    name: dask-worker
    ports:
    - containerPort: 8786
      name: scheduler
      protocol: TCP
    - containerPort: 8787
      name: dashboard
      protocol: TCP
    - containerPort: 8788
      name: api
      protocol: TCP
    resources:
      limits:
        cpu: "1"
        memory: "3435973836"
      requests:
        cpu: 100m
        memory: "3"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/dask-credentials/
      name: dask-credentials
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: gke-ds-platform-dask-workers-501ae6d4-nf6w
  nodeSelector:
    node-purpose: dask-workers
  priority: 0
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: dask-workers
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: dask-credentials
    secret:
      defaultMode: 420
      secretName: dask-gateway-a18bf505dd3d40f0a3178526278c86bb
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-05-04T19:42:45Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-05-04T20:30:52Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-05-04T20:30:52Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-05-04T19:42:45Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://88992fa88a7114216f623df73d0dd1a89e94769bd2399beba2ab923a2ff42728
    image: eu.gcr.io/ds-platform/dsp-user-environment:d44defb
    imageID: docker-pullable://eu.gcr.io/ds-platform/dsp-user-environment@sha256:39f98ee0a35551e0e7d4ad9bf26ac4edf00bcbd3a9efa4d74e764f051b32e924
    lastState: {}
    name: dask-worker
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: "2020-05-04T20:30:51Z"
  hostIP: 10.68.0.30
  phase: Running
  podIP: 10.72.7.2
  qosClass: Burstable
  startTime: "2020-05-04T19:42:45Z"

consideRatio avatar May 04 '20 21:05 consideRatio

But... Still it says restartCount: 0 on it. Perhaps the pod was modified with a kubectl apply or similarly?

Not sure? Dask-gateway doesn't handle any of that, so if the pod controller didn't reset it to 0 then something else might have. We only observe pods, not modify them.

jcrist avatar May 04 '20 21:05 jcrist

@jcrist, ah the scheduler creates the pods, and that is dask.distributed that owns the scheduler? Ah.

consideRatio avatar May 04 '20 21:05 consideRatio

To clarify, did you see the two pods die and the remaining pod restart all at the same time (43 min in)?

Yepp! Or at least the three worker pods were alive and well 38 minutes in, and then 43 minutes in i saw a single worker pod being in a ContainerCreating status.

consideRatio avatar May 04 '20 21:05 consideRatio

According to https://stackoverflow.com/a/59658670, the restartCount is computed according to the number of dead containers that have yet to be cleaned up. If the kubelet is cleaning up containers, it will cause that number to be reset.

@jcrist, ah the scheduler creates the pods, and that is dask.distributed that owns the scheduler? Ah.

No, sorry. dask-gateway creates and manages the scheduler and worker pods, but once a pod is created we only obseve it, we don't ever update a created pod (e.g. for a restart). The k8s pod controller is responsible for handling that. dask.distributed doesn't know or do anything with k8s.

jcrist avatar May 04 '20 21:05 jcrist

According to https://stackoverflow.com/a/59658670, the restartCount is computed according to the number of dead containers that have yet to be cleaned up. If the kubelet is cleaning up containers, it will cause that number to be reset.

Nice catch! Ah, well that was not very intuitive considering the name they have on the field.

No, sorry. dask-gateway creates and manages the scheduler and worker pods, but once a pod is created we only obseve it, we don't ever update a created pod (e.g. for a restart). The k8s pod controller is responsible for handling that. dask.distributed doesn't know or do anything with k8s.

Ah!

My ideas of the situation now:

  1. DaskGateway's api-server creates a DaskCluster resource, and the controller adds worker pods and a scheduler pod for the DaskCluster resource, the API Server reads information from the daskcluster resource and reports back to a DaskGateway client speaking with it?
  2. If a worker pod die, they have died from a shutdown, not because something issued a kubectl delete equivalent k8s api request.
  3. The controller acting on the DaskCluster resource will attempt to add worker pods if needed. If pods have been added that lingers after a scaledown, the next scale up will probably not lead to new worker pods because the controller will see the lingering pods even though the worker pods connections to the scheduler pod may have been severed.

consideRatio avatar May 04 '20 21:05 consideRatio

A PodSpec has a restartPolicy field with possible values Always, OnFailure, and Never. The default value is Always. restartPolicy applies to all Containers in the Pod.

I observe that the worker pod currently have a policy of restartPolicy: OnFailure though.


I repeated the experiment with a restartPolicy of Never, and there was no difference, but I expect there to be a difference when the event occurring ~43 minutes in if the times will be similar. Because, at that point in time we had a container restart due to a failure I think, while two pods had container completions without failure, I think.


It seems like this is the trigger that is meant to make the workers die, but they don't. Are they meant to die from this? Oh.... wait...

distributed.scheduler - INFO - Retire workers {<Worker 'tls://10.72.7.5:46847', name: dask-gateway-fc4ds, memory: 0, processing: 0>, <Worker 'tls://10.72.7.4:45735', name: dask-gateway-jjdpm, memory: 0, processing: 0>}
distributed.scheduler - INFO - Closing worker tls://10.72.7.5:46847
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.72.7.5:46847', name: dask-gateway-fc4ds, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.72.7.5:46847
distributed.scheduler - INFO - Closing worker tls://10.72.7.4:45735
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.72.7.4:45735', name: dask-gateway-jjdpm, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.72.7.4:45735

Perhaps it matters that I'm using this entrypoint in my image?

ENTRYPOINT ["tini", "-g", "--"]

UPDATE: No, because the dask-worker image also does that. As can be seen in https://github.com/dask/dask-gateway/blob/master/resources/helm/images/dask-gateway-base/Dockerfile.


Only args of the container is overridden, which in k8s lingo maps to CMD in Dockerfile lingo (command in k8s maps to ENTRYPOINT in Dockerfiles).

    Args:
      dask-gateway-worker
      --nthreads
      1
      --memory-limit
      3435973836
      --scheduler-address
      tls://dask-gateway-0c66d56247e044259ab211cfdb603d3c.dask-gateway:8786

consideRatio avatar May 04 '20 21:05 consideRatio

Hmmm, there goes that theory. It seems like there might be a bug in distributed. I'll take a look later this week. What version of distributed do you have in your image?

It seems like this is the trigger that is meant to make the workers die, but they don't. Are they meant to die from this? Oh.... wait...

Yeah, that's the scheduler saying that it has told the worker to close and disconnected it, but the worker is still remaining up (a bug).

jcrist avatar May 04 '20 21:05 jcrist

Version

Dask Helm chart v0.7.1

My JupyterHub user environment use this image, which is also used by the worker pods.

$ pip list | grep dask
dask                          2.15.0
dask-gateway                  0.7.1
dask-labextension             2.0.2

In the Dockerfile i install this in pip.

        dask[complete] \
        dask-gateway==0.7.1 \
        dask_labextension \

I also install the dask-labextension jupyterlab extension, and does a jupyter serverextension enable --sys-prefix --py dask_labextension.

consideRatio avatar May 04 '20 21:05 consideRatio

My JupyterHub user environment use this image, which is also used by the worker pods.

Sorry, the important library is distributed (not dask) - what version of that do you have installed?

jcrist avatar May 04 '20 21:05 jcrist

$ pip list | grep distributed
distributed                   2.15.2

consideRatio avatar May 04 '20 21:05 consideRatio