nebari [BUG] - extra_resource_limits in the JupyterHub profile in config prevents node autoscaling

Describe the bug

extra_resource_limits in the JupyterHub profile in config prevents node autoscaling.

The following config fails to launch the GPU instance.

Say you have a node group like so:

  node_groups:
    heavy-weight:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false

And a profile like so:

profiles:
  - display_name: Heavy Instance
    description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
    kubespawner_override:
      cpu_limit: 4
      cpu_guarantee: 3
      mem_limit: 16G
      mem_guarantee: 10G
      node_selector:
        dedicated: heavy-weight
      image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.2
      extra_pod_config:
        volumes:
        - name: "dshm"
          emptyDir:
            medium: "Memory"
            sizeLimit: "2Gi"
      extra_container_config:
        volumeMounts:
        - name: "dshm"
          mountPath: "/dev/shm"
      extra_resource_limits:
        nvidia.com/gpu: 1

The Cluster autoscaler for Juypter will time out and not scale the node.

Expected behavior

The GPU node should scale to 1, and the GPU pod should be placed on that.

OS and architecture in which you are running Nebari

AWS Latest develop 0.1.dev1380+g6566e4c or 2024.3.2

How to Reproduce the problem?

Nebari config:

provider: aws
namespace: dev
nebari_version: 2024.3.2
project_name: nebari-dev
domain: at.quansight.dev
ci_cd:
  type: none
terraform_state:
  type: remote
security:
  keycloak:
    initial_root_password: xxxx
  authentication:
    type: password
theme:
  jupyterhub:
    hub_title: Nebari - nebari-dev
    welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
      documentation</a>. If you have any questions or feedback, reach the team on
      <a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
      forums</a>.
    hub_subtitle: Your open source data science platform, hosted on Amazon Web Services
default_images:
  jupyterhub: quay.io/nebari/nebari-jupyterhub:2024.3.2
  jupyterlab: quay.io/nebari/nebari-jupyterlab:2024.3.2
  dask_worker: quay.io/nebari/nebari-dask-worker:2024.3.2
amazon_web_services:
  region: eu-west-1
  kubernetes_version: '1.29'
  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 5
    user:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    worker:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    fly-weight:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    middle-weight:
      instance: m5.2xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    heavy-weight:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
profiles:
  jupyterlab:
  - display_name: Small Instance
    description: Stable environment with 1.5-2 cpu / 6-8 GB ram
    default: true
    kubespawner_override:
      cpu_limit: 2
      cpu_guarantee: 1.5
      mem_limit: 8G
      mem_guarantee: 6G
      node_selector:
        dedicated: fly-weight
  - display_name: Medium Instance
    description: Stable environment with 1.5-2 cpu / 6-8 GB ram
    kubespawner_override:
      cpu_limit: 4
      cpu_guarantee: 2
      mem_limit: 12G
      mem_guarantee: 8G
      node_selector:
        dedicated: middle-weight

  - display_name: Heavy Instance
    description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
    kubespawner_override:
      cpu_limit: 4
      cpu_guarantee: 3
      mem_limit: 16G
      mem_guarantee: 10G
      node_selector:
        dedicated: heavy-weight
      image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.2
      extra_pod_config:
        volumes:
        - name: "dshm"
          emptyDir:
            medium: "Memory"
            sizeLimit: "2Gi"
      extra_container_config:
        volumeMounts:
        - name: "dshm"
          mountPath: "/dev/shm"
      extra_resource_limits:
        nvidia.com/gpu: 1

certificate:
  type: lets-encrypt
  acme_email: [email protected]
  acme_server: https://acme-v02.api.letsencrypt.org/directory
dns:
  provider: cloudflare
  auto_provision: true
jhub_apps:
  enabled: true

Command output

On the UI we can see the following messages.

> Your server is starting up.
> 
> You will be redirected automatically when it's ready for you.
> 
> 72% Complete
> 2024-03-20T14:35:14Z [Normal] pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector, 1 Insufficient nvidia.com/gpu
> 
> Event log
> Server requested
> 2024-03-20T14:29:37.480922Z [Warning] 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
> 2024-03-20T14:30:12Z [Normal] pod didn't trigger scale-up: 1 Insufficient nvidia.com/gpu, 5 node(s) didn't match Pod's node affinity/selector
> 2024-03-20T14:35:06.738860Z [Warning] 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
> 2024-03-20T14:35:14Z [Normal] pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector, 1 Insufficient nvidia.com/gpu
> Spawn failed: Timeout

Lots from k9s: 



Name:                 jupyter-at
Namespace:            dev
Priority:             0
Priority Class Name:  jupyterhub-dev-default-priority
Service Account:      default
Node:                 <none>
Labels:               app=jupyterhub
                      chart=jupyterhub-3.2.1
                      component=singleuser-server
                      heritage=jupyterhub
                      hub.jupyter.org/network-access-hub=true
                      hub.jupyter.org/servername=
                      hub.jupyter.org/username=at
                      release=jupyterhub-dev
Annotations:          hub.jupyter.org/username: at
Status:               Pending
IP:                   
IPs:                  <none>
Init Containers:
  initialize-home-mount:
    Image:      busybox:1.31
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      mkdir -p /mnt/home/at && chmod 777 /mnt/home/at && cp -r /etc/skel/. /mnt/home/at
    Environment:  <none>
    Mounts:
      /etc/skel from skel (rw)
      /mnt/home/at from home (rw,path="home/at")
  initialize-shared-mounts:
    Image:      busybox:1.31
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      mkdir -p /mnt/shared/admin && chmod 777 /mnt/shared/admin && mkdir -p /mnt/shared/analyst && chmod 777 /mnt/shared/analyst && mkdir -p /mnt/shared/developer && chmod 777 /mnt/shared/developer && mkdir -p /mnt/shared/superadmin && chmod 777 /mnt/shared/superadmin && mkdir -p /mnt/shared/users && chmod 777 /mnt/shared/users
    Environment:  <none>
    Mounts:
      /mnt/shared/admin from home (rw,path="shared/admin")
      /mnt/shared/analyst from home (rw,path="shared/analyst")
      /mnt/shared/developer from home (rw,path="shared/developer")
      /mnt/shared/superadmin from home (rw,path="shared/superadmin")
      /mnt/shared/users from home (rw,path="shared/users")
  initialize-conda-store-mounts:
    Image:      busybox:1.31
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      mkdir -p /mnt/at && chmod 755 /mnt/at && mkdir -p /mnt/nebari-git && chmod 755 /mnt/nebari-git && mkdir -p /mnt/global && chmod 755 /mnt/global && mkdir -p /mnt/admin && chmod 755 /mnt/admin && mkdir -p /mnt/analyst && chmod 755 /mnt/analyst && mkdir -p /mnt/developer && chmod 755 /mnt/developer && mkdir -p /mnt/superadmin && chmod 755 /mnt/superadmin && mkdir -p /mnt/users && chmod 755 /mnt/users
    Environment:  <none>
    Mounts:
      /mnt/admin from conda-store (rw,path="admin")
      /mnt/analyst from conda-store (rw,path="analyst")
      /mnt/at from conda-store (rw,path="at")
      /mnt/developer from conda-store (rw,path="developer")
      /mnt/global from conda-store (rw,path="global")
      /mnt/nebari-git from conda-store (rw,path="nebari-git")
      /mnt/superadmin from conda-store (rw,path="superadmin")
      /mnt/users from conda-store (rw,path="users")
Containers:
  notebook:
    Image:      quay.io/nebari/nebari-jupyterlab-gpu:2024.3.2
    Port:       8888/TCP
    Host Port:  0/TCP
    Args:
      jupyterhub-singleuser
      --debug
    Limits:
      cpu:             4
      memory:          17179869184
      nvidia.com/gpu:  1
    Requests:
      cpu:             3
      memory:          10737418240
      nvidia.com/gpu:  1
    Environment:
      ARGO_BASE_HREF:                          /argo
      ARGO_NAMESPACE:                          dev
      ARGO_SERVER:                             at.quansight.dev:443
      ARGO_TOKEN:                              <set to the key 'token' in secret 'argo-admin.service-account-token'>                     Optional: false
      CONDA_STORE_SERVICE:                     <set to the key 'conda-store-service-name' in secret 'argo-workflows-conda-store-token'>  Optional: false
      CONDA_STORE_TOKEN:                       <set to the key 'conda-store-api-token' in secret 'argo-workflows-conda-store-token'>     Optional: false
      CPU_GUARANTEE:                           3.0
      CPU_LIMIT:                               4.0
      HOME:                                    /home/at
      JPY_API_TOKEN:                           15c1f15f3585423ab90271afbe7ae583
      JUPYTERHUB_ACTIVITY_URL:                 http://hub:8081/hub/api/users/at/activity
      JUPYTERHUB_ADMIN_ACCESS:                 1
      JUPYTERHUB_API_TOKEN:                    15c1f15f3585423ab90271afbe7ae583
      JUPYTERHUB_API_URL:                      http://hub:8081/hub/api
      JUPYTERHUB_BASE_URL:                     /
      JUPYTERHUB_CLIENT_ID:                    jupyterhub-user-at
      JUPYTERHUB_DEBUG:                        1
      JUPYTERHUB_DEFAULT_URL:                  /lab
      JUPYTERHUB_HOST:                         
      JUPYTERHUB_OAUTH_ACCESS_SCOPES:          ["access:servers!server=at/", "access:servers!user=at"]
      JUPYTERHUB_OAUTH_CALLBACK_URL:           /user/at/oauth_callback
      JUPYTERHUB_OAUTH_CLIENT_ALLOWED_SCOPES:  ]
      JUPYTERHUB_OAUTH_SCOPES:                 ["access:servers!server=at/", "access:servers!user=at"]
      JUPYTERHUB_ROOT_DIR:                     /home/at
      JUPYTERHUB_SERVER_NAME:                  
      JUPYTERHUB_SERVICE_PREFIX:               /user/at/
      JUPYTERHUB_SERVICE_URL:                  http://0.0.0.0:8888/user/at/
      JUPYTERHUB_SINGLEUSER_APP:               jupyter_server.serverapp.ServerApp
      JUPYTERHUB_USER:                         at
      JUPYTER_IMAGE:                           quay.io/nebari/nebari-jupyterlab-gpu:2024.3.2
      JUPYTER_IMAGE_SPEC:                      quay.io/nebari/nebari-jupyterlab-gpu:2024.3.2
      LD_PRELOAD:                              libnss_wrapper.so
      MEM_GUARANTEE:                           10737418240
      MEM_LIMIT:                               17179869184
      NB_UMASK:                                0002
      NSS_WRAPPER_GROUP:                       /tmp/group
      NSS_WRAPPER_PASSWD:                      /tmp/passwd
      PIP_REQUIRE_VIRTUALENV:                  true
      PREFERRED_USERNAME:                      at
      SHELL:                                   /bin/bash
    Mounts:
      /dev/shm from dshm (rw)
      /etc/dask from dask-etc (rw)
      /etc/ipython from etc-ipython (rw)
      /etc/jupyter from etc-jupyter (rw)
      /home/at from home (rw,path="home/at")
      /home/conda/admin from conda-store (rw,path="admin")
      /home/conda/analyst from conda-store (rw,path="analyst")
      /home/conda/at from conda-store (rw,path="at")
      /home/conda/developer from conda-store (rw,path="developer")
      /home/conda/global from conda-store (rw,path="global")
      /home/conda/nebari-git from conda-store (rw,path="nebari-git")
      /home/conda/superadmin from conda-store (rw,path="superadmin")
      /home/conda/users from conda-store (rw,path="users")
      /opt/conda/envs/default/share/jupyter/lab/settings from jupyterlab-settings (rw)
      /shared/admin from home (rw,path="shared/admin")
      /shared/analyst from home (rw,path="shared/analyst")
      /shared/developer from home (rw,path="shared/developer")
      /shared/superadmin from home (rw,path="shared/superadmin")
      /shared/users from home (rw,path="shared/users")
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  home:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  jupyterhub-dev-share
    ReadOnly:   false
  skel:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      etc-skel
    Optional:  false
  conda-store:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  conda-store-dev-share
    ReadOnly:   false
  dask-etc:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      dask-etc
    Optional:  false
  etc-ipython:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      etc-ipython
    Optional:  false
  etc-jupyter:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      etc-jupyter
    Optional:  false
  jupyterlab-settings:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      jupyterlab-settings
    Optional:  false
  dshm:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      Memory
    SizeLimit:   2Gi
QoS Class:       Burstable
Node-Selectors:  dedicated=heavy-weight
Tolerations:     hub.jupyter.org/dedicated=user:NoSchedule
                 hub.jupyter.org_dedicated=user:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                 nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason             Age   From                           Message
  ----     ------             ----  ----                           -------
  Warning  FailedScheduling   41s   jupyterhub-dev-user-scheduler  0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  6s    cluster-autoscaler             pod didn't trigger scale-up: 1 Insufficient nvidia.com/gpu, 5 node(s) didn't match Pod's node affinity/selector

Versions and dependencies used.

$  kubectl version
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1-eks-b9c9ed7

$ nebari -V
2024.3.2

Compute environment

None

Integrations

No response

Anything else?

Relevant links:

https://aws.amazon.com/blogs/compute/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks/
https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#enabling-gpu-support-in-kubernetes
https://github.com/kubernetes/autoscaler/issues/3869#issuecomment-825512767
https://varlogdiego.com/kubernetes-and-gpu-nodes-on-aws

Mar 20 '24 14:03 pt247

I can see in logs of Describe for dev/jupyter-at pod:

│ QoS Class:       Burstable                                                                                              │
│ Node-Selectors:  dedicated=heavy-weight                                                                                 │
│ Tolerations:     hub.jupyter.org/dedicated=user:NoSchedule                                                              │
│                  hub.jupyter.org_dedicated=user:NoSchedule                                                              │
│                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s                                              │
│                  node.kubernetes.io/unreachable:NoExecute op=Exists for 300s                                            │
│                  nvidia.com/gpu:NoSchedule op=Exists

The pods that scale up don't have this last Tolerance set nvidia.com/gpu:NoSchedule op=Exists.

Upon removing the following configuration, everything worked fine.

      extra_resource_limits:
        nvidia.com/gpu: 1

and nvidia.com/gpu:NoSchedule op=Exists also is not present in Tolerances

Mar 20 '24 14:03 pt247

Okay, I found something interesting!

diff --git a/src/_nebari/stages/kubernetes_initialize/template/modules/cluster-autoscaler/main.tf b/src/_nebari/stages/kubernetes_initialize/template/modules/cluster-autoscaler/main.tf
index 29f982c..39d13c8 100644
--- a/src/_nebari/stages/kubernetes_initialize/template/modules/cluster-autoscaler/main.tf
+++ b/src/_nebari/stages/kubernetes_initialize/template/modules/cluster-autoscaler/main.tf
@@ -4,7 +4,7 @@ resource "helm_release" "autoscaler" {
 
   repository = "https://kubernetes.github.io/autoscaler"
   chart      = "cluster-autoscaler"
-  version    = "9.19.0"
+  version    = "9.36.0"

Upgrading the version of cluster-autoscaler allows the node to come up. But taints and tolerances still don't agree. So, no pod placements take place.

Mar 20 '24 18:03 pt247

Check if the nvidia.com/gpu: 2 works.

Mar 26 '24 15:03 pt247

@viniciusdc this is the issue that we discussed today

Mar 26 '24 15:03 marcelovilla

I was not able to reproduce this, the following configuration worked for me:

amazon_web_services:
  kubernetes_version: "1.29"
  region: us-east-1
  node_groups:
    ...
    gpu-tesla-g4:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 5
      single_subnet: false
      gpu: true
profiles:
  jupyterlab:
    ...
    - display_name: G4 GPU Instance 1x
      description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
      kubespawner_override:
        image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
        cpu_limit: 4
        cpu_guarantee: 3
        mem_limit: 16G
        mem_guarantee: 10G
        extra_pod_config:
          volumes:
            - name: "dshm"
              emptyDir:
                medium: "Memory"
                sizeLimit: "2Gi"
        extra_container_config:
          volumeMounts:
            - name: "dshm"
              mountPath: "/dev/shm"
        extra_resource_limits:
          nvidia.com/gpu: 1
        node_selector:
          "dedicated": "gpu-tesla-g4"

it's worth mentioning that Jupyter will not scale up the profile and timeout if the GPU families are not enabled as part of the account's quotas.

Apr 18 '24 13:04 viniciusdc