nebari icon indicating copy to clipboard operation
nebari copied to clipboard

[BUG] - Unable to force Dask workers to run on AWS EKS specific nodegroups

Open limacarvalho opened this issue 2 years ago • 5 comments

OS system and architecture in which you are running QHub

Amazon Linux 2 on AWS

Expected behavior

:information_source: Be able to place Dask workers and/or scheduler pods running on specific AWS EKS node groups, based on the information provided in the qhub-config.yaml file (dask_worker and node_groups profiles)

Actual behavior

Whenever a Qhub deployment is done on AWS, and a custom dask_worker entry/profile is created in the qhub-config.yaml file referring to one of the specified node_groups entries also available in the config file, the Kubernetes cluster seems to fail placing the Dask worker(s) and scheduler(s) pods in the appropriate EKS node group(s).

  • The qhub init command with the appropriate arguments is executed and generates the qhub-config.yaml file;
  • Some modifications are made to the config file, in this case, EKS node_groups profiles are added;
  • Additionally, dask_worker profiles are also added to the config file, as well as the nodeSelector key/value pair which is specified so that it references the desired Kubernetes node where the Dask scheduler(s) and/or worker(s) is/are going to be executed on. As described in the documentation -> Setting specific dask workers to run on a nodegroup
  • The qhub deploy -c qhub-config.yaml --disable-render (since I normally render the files before due to VPC settings that need changing in my setup) command is executed successfully and deployment goes through as expected.
  • When the shared/examples/dask-gateway.ipynb is used to test Dask, by setting up the options to Environment = filesystem/dask and Cluster Profile = GPU Worker / 4xCPU Cores / 30GB MEM / 1x GPU (gpu custom profile created in the qhug config) the Dask Gateway tries to place the Dask worker pod in the worker node_group instead of gpu (I've tried different configurations, including specifying the same worker name in scheduler_extra_pod_config and worker_extra_pod_config)

qhub-config.yaml file looks something like this:

project_name: my-qhub-deploy
provider: aws
...
terraform_state:
  type: remote
namespace: dev
qhub_version: 0.4.3
amazon_web_services:
  region: us-east-1
  kubernetes_version: '1.22'
  node_groups:
    general:
      instance: m6i.xlarge
      min_nodes: 1
      max_nodes: 1
    user:
      instance: m6i.xlarge
      min_nodes: 1
      max_nodes: 5
    worker:
      instance: m6i.2xlarge
      min_nodes: 2
      max_nodes: 10
    gpu:
      instance: g4dn.2xlarge
      min_nodes: 1
      max_nodes: 3
      gpu: true
...
  dask_worker:
    "GPU Worker / 4xCPU Cores / 30GB MEM / 1x GPU":
      worker_cores_limit: 4
      worker_cores: 4
      worker_memory_limit: 30G
      worker_memory: 30G
      worker_threads: 6
      scheduler_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": worker
      worker_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": gpu
    "Large Worker / 4xCPU Cores / 30GB MEM / no GPU":
      worker_cores_limit: 4
      worker_cores: 4
      worker_memory_limit: 30G
      worker_memory: 30G
      worker_threads: 8
      scheduler_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": worker
      worker_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": worker
...

While troubleshooting, I had a look into the K8s qhub-daskgateway-gateway secret which holds the config.json key (a Base64 encoded JSON payload). It seems that worker-node-group key is always the same (containing only the entry for "worker"):

"worker-node-group": {
    "key": "eks.amazonaws.com/nodegroup",
    "value": "worker"
}

It seems a bit strange because the additional dask worker node was specified under node_groups in the qhub-config.yaml and the appropriate nodeSelector was added to the dask_worker entries.

While troubleshooting, I decided to put together a flowchart to help me go through the process. Here it goes: troubleshooting_issue

I'm not sure if something went wrong on my end, but I ran a couple of clean deployments just by generating a qhub-config.yaml file using qhub init and adding the node_group entries as well the dask_workers.

My apologies for such long description. Kudos to everyone here for such a great piece of software that qhub is :rocket:

How to Reproduce the problem?

  • Created a new Python 3.9 env and install qhub
  • Did a setup initialization for AWS eg. qhub init aws --project my-project --domain qhub.mydomain.com --ssl-cert-email [email protected]
  • Added to the auto-generated qhub-config.yaml file a new entry under node_groups and a new entry also under dask_worker:
amazon_web_services:
  region: us-east-1
  kubernetes_version: '1.22'
  node_groups:
    general:
      instance: m6i.xlarge
      min_nodes: 1
      max_nodes: 1
    user:
      instance: m6i.xlarge
      min_nodes: 1
      max_nodes: 5
    worker:
      instance: m6i.2xlarge
      min_nodes: 1
      max_nodes: 10
    gpu:
      instance: g4dn.2xlarge
      min_nodes: 1
      max_nodes: 3
      gpu: true
...
  dask_worker:
    "GPU Worker":
      worker_cores_limit: 4
      worker_cores: 4
      worker_memory_limit: 30G
      worker_memory: 30G
      worker_threads: 6
      scheduler_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": gpu
      worker_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": gpu
...
  • Executed qhub deploy -c qhub-config.yaml
  • System deployed successfully, but Dask fails to place the worker pod on the appropriate EKS node eg. gpu

Command output

-

Versions and dependencies used.

  • qhub_version: 0.4.3
  • amazon_web_services (region): us-east-1
  • kubernetes_version: '1.22'

Compute environment

AWS

Integrations

Dask

Anything else?

:white_check_mark: I was able to work around this issue by adjusting the dask_gateway_config.py file that is part of the K8s configmap/qhub-daskgateway-gateway.

Before:

def base_node_group():
    worker_node_group = {
        config["worker-node-group"]["key"]: config["worker-node-group"]["value"]
    }

    return {
        "scheduler_extra_pod_config": {"nodeSelector": worker_node_group},
        "worker_extra_pod_config": {"nodeSelector": worker_node_group},
    }

After:

def base_node_group(options):
    worker_node_group = config["profiles"][options.profile]["worker_extra_pod_config"]["nodeSelector"]
    scheduler_node_group = config["profiles"][options.profile]["scheduler_extra_pod_config"]["nodeSelector"]

    return {
        "scheduler_extra_pod_config": {"nodeSelector": scheduler_node_group},
        "worker_extra_pod_config": {"nodeSelector": worker_node_group},
    }

#...
# Adding the "options" object as argument to the base_node_group() function call
def worker_profile(options, user):
    namespace, name = options.conda_environment.split("/")
    return functools.reduce(
        deep_merge,
        [
            base_node_group(options),
            base_conda_store_mounts(namespace, name),
            base_username_mount(user.name),
            config["profiles"][options.profile],
            {"environment": {**options.environment_vars}},
        ],
        {},
    )

So that the worker pod is placed onto the correct node specified in the nodeSelector key available in the qhub-config.yaml. Additionally, I can also specify in which node to run the scheduler pod as well, so that increases the flexibility of the setup (was this the idea when this code was initially pushed?)

Any info or help needed, always feel free to reach out! Many thanks!

limacarvalho avatar Aug 24 '22 21:08 limacarvalho

Hey @limacarvalho, thank you for such a detailed bug report, we appreciate it very much :)

The evidence you've presented is compelling! To confirm, you were able to confirm that the dask worker running in the gpu node group successfully completed whatever job was submitted to it? (if you want to test this quickly, you can use this notebook)

We'd happily accept your fix if you're interested in opening a PR.

iameskild avatar Aug 25 '22 04:08 iameskild

Hi @limacarvalho, are you still interesting in contributing your fix as a PR? 😊

iameskild avatar Oct 04 '22 16:10 iameskild

Hi @iameskild omg so sorry for the very late feedback on this one, and thanks for your input on this 🙏 I've tested recently and it seems the fix will be as easy as I mentioned before. I'm going to fork, do a final clean deployment and open up a PR. Cheers 🍻

limacarvalho avatar Oct 06 '22 19:10 limacarvalho

That's wonderful! Thank you @limacarvalho :)

iameskild avatar Oct 07 '22 00:10 iameskild

Hi @iameskild,

Apologies for the delay pushing the code. I was trying to deploy a clean v.0.4.4 in order to run an end-to-end test but somehow I got stuck and unable to deploy due to some errors (Attempt 4 failed connecting to keycloak master realm...InsecureRequestWarning: Unverified HTTPS request is being made to host 'dev.limacarvalho.com'. Adding certificate verification is strongly advised). I still need to figure out if I'm doing something wrong, if my env is messed up or if something related to SSL certs has changed for 0.4.4.

I will keep you updated here about the test result 🙏

limacarvalho avatar Oct 11 '22 13:10 limacarvalho