[BUG] - extra_resource_limits in the JupyterHub profile in config prevents node autoscaling
Describe the bug
extra_resource_limits in the JupyterHub profile in config prevents node autoscaling.
The following config fails to launch the GPU instance.
Say you have a node group like so:
node_groups:
heavy-weight:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
And a profile like so:
profiles:
- display_name: Heavy Instance
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
node_selector:
dedicated: heavy-weight
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.2
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 1
The Cluster autoscaler for Juypter will time out and not scale the node.
Expected behavior
The GPU node should scale to 1, and the GPU pod should be placed on that.
OS and architecture in which you are running Nebari
AWS Latest develop 0.1.dev1380+g6566e4c or 2024.3.2
How to Reproduce the problem?
Nebari config:
provider: aws
namespace: dev
nebari_version: 2024.3.2
project_name: nebari-dev
domain: at.quansight.dev
ci_cd:
type: none
terraform_state:
type: remote
security:
keycloak:
initial_root_password: xxxx
authentication:
type: password
theme:
jupyterhub:
hub_title: Nebari - nebari-dev
welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
documentation</a>. If you have any questions or feedback, reach the team on
<a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
forums</a>.
hub_subtitle: Your open source data science platform, hosted on Amazon Web Services
default_images:
jupyterhub: quay.io/nebari/nebari-jupyterhub:2024.3.2
jupyterlab: quay.io/nebari/nebari-jupyterlab:2024.3.2
dask_worker: quay.io/nebari/nebari-dask-worker:2024.3.2
amazon_web_services:
region: eu-west-1
kubernetes_version: '1.29'
node_groups:
general:
instance: m5.2xlarge
min_nodes: 1
max_nodes: 5
user:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
worker:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
fly-weight:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
middle-weight:
instance: m5.2xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
heavy-weight:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
profiles:
jupyterlab:
- display_name: Small Instance
description: Stable environment with 1.5-2 cpu / 6-8 GB ram
default: true
kubespawner_override:
cpu_limit: 2
cpu_guarantee: 1.5
mem_limit: 8G
mem_guarantee: 6G
node_selector:
dedicated: fly-weight
- display_name: Medium Instance
description: Stable environment with 1.5-2 cpu / 6-8 GB ram
kubespawner_override:
cpu_limit: 4
cpu_guarantee: 2
mem_limit: 12G
mem_guarantee: 8G
node_selector:
dedicated: middle-weight
- display_name: Heavy Instance
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
node_selector:
dedicated: heavy-weight
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.2
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 1
certificate:
type: lets-encrypt
acme_email: [email protected]
acme_server: https://acme-v02.api.letsencrypt.org/directory
dns:
provider: cloudflare
auto_provision: true
jhub_apps:
enabled: true
Command output
On the UI we can see the following messages.
> Your server is starting up.
>
> You will be redirected automatically when it's ready for you.
>
> 72% Complete
> 2024-03-20T14:35:14Z [Normal] pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector, 1 Insufficient nvidia.com/gpu
>
> Event log
> Server requested
> 2024-03-20T14:29:37.480922Z [Warning] 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
> 2024-03-20T14:30:12Z [Normal] pod didn't trigger scale-up: 1 Insufficient nvidia.com/gpu, 5 node(s) didn't match Pod's node affinity/selector
> 2024-03-20T14:35:06.738860Z [Warning] 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
> 2024-03-20T14:35:14Z [Normal] pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector, 1 Insufficient nvidia.com/gpu
> Spawn failed: Timeout
Lots from k9s:
Name: jupyter-at
Namespace: dev
Priority: 0
Priority Class Name: jupyterhub-dev-default-priority
Service Account: default
Node: <none>
Labels: app=jupyterhub
chart=jupyterhub-3.2.1
component=singleuser-server
heritage=jupyterhub
hub.jupyter.org/network-access-hub=true
hub.jupyter.org/servername=
hub.jupyter.org/username=at
release=jupyterhub-dev
Annotations: hub.jupyter.org/username: at
Status: Pending
IP:
IPs: <none>
Init Containers:
initialize-home-mount:
Image: busybox:1.31
Port: <none>
Host Port: <none>
Command:
sh
-c
mkdir -p /mnt/home/at && chmod 777 /mnt/home/at && cp -r /etc/skel/. /mnt/home/at
Environment: <none>
Mounts:
/etc/skel from skel (rw)
/mnt/home/at from home (rw,path="home/at")
initialize-shared-mounts:
Image: busybox:1.31
Port: <none>
Host Port: <none>
Command:
sh
-c
mkdir -p /mnt/shared/admin && chmod 777 /mnt/shared/admin && mkdir -p /mnt/shared/analyst && chmod 777 /mnt/shared/analyst && mkdir -p /mnt/shared/developer && chmod 777 /mnt/shared/developer && mkdir -p /mnt/shared/superadmin && chmod 777 /mnt/shared/superadmin && mkdir -p /mnt/shared/users && chmod 777 /mnt/shared/users
Environment: <none>
Mounts:
/mnt/shared/admin from home (rw,path="shared/admin")
/mnt/shared/analyst from home (rw,path="shared/analyst")
/mnt/shared/developer from home (rw,path="shared/developer")
/mnt/shared/superadmin from home (rw,path="shared/superadmin")
/mnt/shared/users from home (rw,path="shared/users")
initialize-conda-store-mounts:
Image: busybox:1.31
Port: <none>
Host Port: <none>
Command:
sh
-c
mkdir -p /mnt/at && chmod 755 /mnt/at && mkdir -p /mnt/nebari-git && chmod 755 /mnt/nebari-git && mkdir -p /mnt/global && chmod 755 /mnt/global && mkdir -p /mnt/admin && chmod 755 /mnt/admin && mkdir -p /mnt/analyst && chmod 755 /mnt/analyst && mkdir -p /mnt/developer && chmod 755 /mnt/developer && mkdir -p /mnt/superadmin && chmod 755 /mnt/superadmin && mkdir -p /mnt/users && chmod 755 /mnt/users
Environment: <none>
Mounts:
/mnt/admin from conda-store (rw,path="admin")
/mnt/analyst from conda-store (rw,path="analyst")
/mnt/at from conda-store (rw,path="at")
/mnt/developer from conda-store (rw,path="developer")
/mnt/global from conda-store (rw,path="global")
/mnt/nebari-git from conda-store (rw,path="nebari-git")
/mnt/superadmin from conda-store (rw,path="superadmin")
/mnt/users from conda-store (rw,path="users")
Containers:
notebook:
Image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.2
Port: 8888/TCP
Host Port: 0/TCP
Args:
jupyterhub-singleuser
--debug
Limits:
cpu: 4
memory: 17179869184
nvidia.com/gpu: 1
Requests:
cpu: 3
memory: 10737418240
nvidia.com/gpu: 1
Environment:
ARGO_BASE_HREF: /argo
ARGO_NAMESPACE: dev
ARGO_SERVER: at.quansight.dev:443
ARGO_TOKEN: <set to the key 'token' in secret 'argo-admin.service-account-token'> Optional: false
CONDA_STORE_SERVICE: <set to the key 'conda-store-service-name' in secret 'argo-workflows-conda-store-token'> Optional: false
CONDA_STORE_TOKEN: <set to the key 'conda-store-api-token' in secret 'argo-workflows-conda-store-token'> Optional: false
CPU_GUARANTEE: 3.0
CPU_LIMIT: 4.0
HOME: /home/at
JPY_API_TOKEN: 15c1f15f3585423ab90271afbe7ae583
JUPYTERHUB_ACTIVITY_URL: http://hub:8081/hub/api/users/at/activity
JUPYTERHUB_ADMIN_ACCESS: 1
JUPYTERHUB_API_TOKEN: 15c1f15f3585423ab90271afbe7ae583
JUPYTERHUB_API_URL: http://hub:8081/hub/api
JUPYTERHUB_BASE_URL: /
JUPYTERHUB_CLIENT_ID: jupyterhub-user-at
JUPYTERHUB_DEBUG: 1
JUPYTERHUB_DEFAULT_URL: /lab
JUPYTERHUB_HOST:
JUPYTERHUB_OAUTH_ACCESS_SCOPES: ["access:servers!server=at/", "access:servers!user=at"]
JUPYTERHUB_OAUTH_CALLBACK_URL: /user/at/oauth_callback
JUPYTERHUB_OAUTH_CLIENT_ALLOWED_SCOPES: ]
JUPYTERHUB_OAUTH_SCOPES: ["access:servers!server=at/", "access:servers!user=at"]
JUPYTERHUB_ROOT_DIR: /home/at
JUPYTERHUB_SERVER_NAME:
JUPYTERHUB_SERVICE_PREFIX: /user/at/
JUPYTERHUB_SERVICE_URL: http://0.0.0.0:8888/user/at/
JUPYTERHUB_SINGLEUSER_APP: jupyter_server.serverapp.ServerApp
JUPYTERHUB_USER: at
JUPYTER_IMAGE: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.2
JUPYTER_IMAGE_SPEC: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.2
LD_PRELOAD: libnss_wrapper.so
MEM_GUARANTEE: 10737418240
MEM_LIMIT: 17179869184
NB_UMASK: 0002
NSS_WRAPPER_GROUP: /tmp/group
NSS_WRAPPER_PASSWD: /tmp/passwd
PIP_REQUIRE_VIRTUALENV: true
PREFERRED_USERNAME: at
SHELL: /bin/bash
Mounts:
/dev/shm from dshm (rw)
/etc/dask from dask-etc (rw)
/etc/ipython from etc-ipython (rw)
/etc/jupyter from etc-jupyter (rw)
/home/at from home (rw,path="home/at")
/home/conda/admin from conda-store (rw,path="admin")
/home/conda/analyst from conda-store (rw,path="analyst")
/home/conda/at from conda-store (rw,path="at")
/home/conda/developer from conda-store (rw,path="developer")
/home/conda/global from conda-store (rw,path="global")
/home/conda/nebari-git from conda-store (rw,path="nebari-git")
/home/conda/superadmin from conda-store (rw,path="superadmin")
/home/conda/users from conda-store (rw,path="users")
/opt/conda/envs/default/share/jupyter/lab/settings from jupyterlab-settings (rw)
/shared/admin from home (rw,path="shared/admin")
/shared/analyst from home (rw,path="shared/analyst")
/shared/developer from home (rw,path="shared/developer")
/shared/superadmin from home (rw,path="shared/superadmin")
/shared/users from home (rw,path="shared/users")
Conditions:
Type Status
PodScheduled False
Volumes:
home:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: jupyterhub-dev-share
ReadOnly: false
skel:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: etc-skel
Optional: false
conda-store:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: conda-store-dev-share
ReadOnly: false
dask-etc:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: dask-etc
Optional: false
etc-ipython:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: etc-ipython
Optional: false
etc-jupyter:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: etc-jupyter
Optional: false
jupyterlab-settings:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: jupyterlab-settings
Optional: false
dshm:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: 2Gi
QoS Class: Burstable
Node-Selectors: dedicated=heavy-weight
Tolerations: hub.jupyter.org/dedicated=user:NoSchedule
hub.jupyter.org_dedicated=user:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 41s jupyterhub-dev-user-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 6s cluster-autoscaler pod didn't trigger scale-up: 1 Insufficient nvidia.com/gpu, 5 node(s) didn't match Pod's node affinity/selector
Versions and dependencies used.
$ kubectl version
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1-eks-b9c9ed7
$ nebari -V
2024.3.2
Compute environment
None
Integrations
No response
Anything else?
Relevant links:
- https://aws.amazon.com/blogs/compute/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks/
- https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#enabling-gpu-support-in-kubernetes
- https://github.com/kubernetes/autoscaler/issues/3869#issuecomment-825512767
- https://varlogdiego.com/kubernetes-and-gpu-nodes-on-aws
I can see in logs of Describe for dev/jupyter-at pod:
â QoS Class: Burstable â
â Node-Selectors: dedicated=heavy-weight â
â Tolerations: hub.jupyter.org/dedicated=user:NoSchedule â
â hub.jupyter.org_dedicated=user:NoSchedule â
â node.kubernetes.io/not-ready:NoExecute op=Exists for 300s â
â node.kubernetes.io/unreachable:NoExecute op=Exists for 300s â
â nvidia.com/gpu:NoSchedule op=Exists
The pods that scale up don't have this last Tolerance set nvidia.com/gpu:NoSchedule op=Exists.
Upon removing the following configuration, everything worked fine.
extra_resource_limits:
nvidia.com/gpu: 1
and nvidia.com/gpu:NoSchedule op=Exists also is not present in Tolerances
Okay, I found something interesting!
diff --git a/src/_nebari/stages/kubernetes_initialize/template/modules/cluster-autoscaler/main.tf b/src/_nebari/stages/kubernetes_initialize/template/modules/cluster-autoscaler/main.tf
index 29f982c..39d13c8 100644
--- a/src/_nebari/stages/kubernetes_initialize/template/modules/cluster-autoscaler/main.tf
+++ b/src/_nebari/stages/kubernetes_initialize/template/modules/cluster-autoscaler/main.tf
@@ -4,7 +4,7 @@ resource "helm_release" "autoscaler" {
repository = "https://kubernetes.github.io/autoscaler"
chart = "cluster-autoscaler"
- version = "9.19.0"
+ version = "9.36.0"
Upgrading the version of cluster-autoscaler allows the node to come up. But taints and tolerances still don't agree. So, no pod placements take place.
Check if the nvidia.com/gpu: 2 works.
@viniciusdc this is the issue that we discussed today
I was not able to reproduce this, the following configuration worked for me:
amazon_web_services:
kubernetes_version: "1.29"
region: us-east-1
node_groups:
...
gpu-tesla-g4:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 5
single_subnet: false
gpu: true
profiles:
jupyterlab:
...
- display_name: G4 GPU Instance 1x
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 1
node_selector:
"dedicated": "gpu-tesla-g4"
it's worth mentioning that Jupyter will not scale up the profile and timeout if the GPU families are not enabled as part of the account's quotas.