pangeo-stacks dask workers can be scheduled on hub pods with default config

Our current setup allows for dask pods on hub nodes: https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/dask_config.yaml

This seems to be due to 'prefer' rather than 'require' when scheduling: https://github.com/dask/dask-kubernetes/blob/ec4666a4af5acad03c24b84aca4fcf8ccd791b4f/dask_kubernetes/objects.py#L177

which results in the following for pods:

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: k8s.dask.org/node-purpose
            operator: In
            values:
            - worker
        weight: 100

not sure how we modify the config file to get the stricter 'require' condition like we have for notebook pods:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: k8s.dask.org/node-purpose
            operator: In
            values:
            - worker

@jhamman , @TomAugspurger

Jul 15 '19 20:07 scottyhq

If you want to keep non-core pods off your core (hub) pool, you need to add a taint that only core pods can tolerate. I tend to just size the core pool to the smallest possible size to fit the hub pods. If you don't leave space, things wont try to schedule there. You can also up the node purpose scheduling requirements for dask pods, but in my experience, this is unnecessary.

For posterity, I should also link to this blog post that describes all of this in more detail: https://medium.com/pangeo/pangeo-cloud-cluster-design-9d58a1bf1ad3

Jul 16 '19 03:07 jhamman

@jhamman - i'm thinking we might want the core pool to autoscale eventually if we try to consolidate multiple hubs on a single EKS cluster. If we add a taint to the core pool, it seems like pods in the kube-system namespace might have trouble (for example aws-node, tiller-deploy, cluster-autoscaler).

Another approach is to expose match_node_purpose="require" in https://github.com/dask/dask-kubernetes/blob/ec4666a4af5acad03c24b84aca4fcf8ccd791b4f/dask_kubernetes/objects.py#L177

Jul 16 '19 04:07 scottyhq

@jhamman is there a downside to the hard affinity (at least optionally)? It couldn't be the default, but it seems useful as an option.

Jul 16 '19 13:07 TomAugspurger

FYI, rather than exposing it as a config / parameter in KubeCluster, we could document how to achieve it.

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '2', --no-bokeh, --memory-limit, 6GB, --death-timeout, '60']
    name: dask
    resources:
      limits:
        cpu: "2"
        memory: 6G
      requests:
        cpu: "2"
        memory: 6G
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          key: k8s.dask.org/node-purpose
          operator: In
          values:
            - worker

On master, that'll result in both the preferred and required affinity types being applied.

>>> a.pod_template.spec.affinity.node_affinity
{'preferred_during_scheduling_ignored_during_execution': [{'preference': {'match_expressions': [{'key': 'k8s.dask.org/node-purpose',
                                                                                                 'operator': 'In',
                                                                                                 'values': ['worker']}],
                                                                          'match_fields': None},
                                                           'weight': 100}],
 'required_during_scheduling_ignored_during_execution': {'node_selector_terms': [{'match_expressions': None,
                                                                                  'match_fields': None}]}}

I'm not sure how Kubernetes will handle that (presumably it's fine, just not the cleanest). Right now my preference would be to add a config option / argument to KubeCluster that's passed through to clean_pod_template, but I may be missing some context.

Jul 16 '19 13:07 TomAugspurger

@jhamman is there a downside to the hard affinity (at least optionally)?

Not really. I think this is a fine approach. Of course, there is not way to enforce that users follow this pattern so dask workers may still end up in your core pool with this approach.

Jul 16 '19 21:07 jhamman

In thinking about this a little more, it may be easier for some to simply add a taint to the core pool that the hub and ingress pods can tolerate.

Aug 02 '19 18:08 jhamman

In thinking about this a little more, it may be easier for some to simply add a taint to the core pool that the hub and ingress pods can tolerate.

@jhamman are you doing this now on the google clusters?

Sep 17 '19 18:09 scottyhq

No. Not yet, but we could.

Sep 17 '19 18:09 jhamman

If you don't feel like modifying all of the JupyterHub services' configurations to include the toleration, this can also be accomplished by 1) adding a taint to the worker pools to prevent scheduling from core services, with corresponding tolerances added to worker pods and 2) adding a node selector to the worker pods with corresponding labels on the worker nodes. This will pretty much guarantee that everything ends up on the right nodes without having to taint/tolerate the core services.

Dec 08 '19 15:12 bgroenks96

pangeo-stacks pangeo-stacks copied to clipboard

dask workers can be scheduled on hub pods with default config

pangeo-stacks
pangeo-stacks copied to clipboard