[BUG] - Why doesn't my code recognize the GPU on Nebari ?
Describe the bug
I purposely named this issue the name of this missing page: https://www.nebari.dev/docs/how-tos/faq#why-doesnt-my-code-recognize-the-gpus-on-nebari 😄
We deployed Nebari 2024.03.03 on AWS and we fired up a GPU server successfully (g4dx.xlarge).
We built an environment following these excellent instructions: https://www.nebari.dev/docs/how-tos/pytorch-best-practices/ (although this page contains the broken link above)
When we conda list the environment, it looks good:
08:38 $ conda activate global-pangeo-ml
(global-pangeo-ml) rsignell:~
08:38 $ conda list cuda
# packages in environment at /home/conda/global/envs/global-pangeo-ml:
#
# Name Version Build Channel
cuda-cudart 11.8.89 0 nvidia
cuda-cupti 11.8.87 0 nvidia
cuda-libraries 11.8.0 0 nvidia
cuda-nvrtc 11.8.89 0 nvidia
cuda-nvtx 11.8.86 0 nvidia
cuda-runtime 11.8.0 0 nvidia
nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
pytorch-cuda 11.8 h7e8668a_5 pytorch
but when we run:
torch.cuda.is_available()
it returns False.
Is it clear what we did wrong?
Or what we should do to debug?
Expected behavior
See above
OS and architecture in which you are running Nebari
Linux
How to Reproduce the problem?
See above
Command output
No response
Versions and dependencies used.
conda 23.3.1 kubernetes 1.29 nebari 2024.03.03
Compute environment
AWS
Integrations
No response
Anything else?
No response
Not sure if this is the answer, but have you tried setting
variables:
CONDA_OVERRIDE_CUDA: "12.0"
in your environment spec?
We built an environment following these excellent instructions: nebari.dev/docs/how-tos/pytorch-best-practices (although this page contains the broken link above)
I'm not sure if it'll resolve the issue you're seeing, but the correct link is https://www.nebari.dev/docs/faq/#why-doesnt-my-code-recognize-the-gpus-on-nebari
conda-forge will install the CPU version of PyTorch unless you use that env flag listed above. This happens because conda-store builds the env on a non-gpu worker and conda-forge detects that there is no GPU present.
@rsignell the new link Adam shared has those instructions for the different possible versions, can you check if that solves the problem you are encountering?
BTW, I opened an issue for the broken link - https://github.com/nebari-dev/nebari-docs/issues/426
@viniciusdc , yes, I used the Nebari-recommended Pytorch tool to create the correct package versions and then I tried to create the simplest possible conda environment for pytorch on Nebari GPU:
channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.11
- pytorch::pytorch
- pytorch::pytorch-cuda=11.8
- numpy
- ipykernel
variables:
CONDA_OVERRIDE_CUDA: "12.0"
It builds without errors, but alas, doesn't recognize cuda:
Also this page https://www.nebari.dev/docs/faq/#why-doesnt-my-code-recognize-the-gpus-on-nebari seems to provide conflicting information: one the one hand it suggests using pytorch-gpu, but this seems at odds with the suggestion to follow https://www.nebari.dev/docs/how-tos/pytorch-best-practices/, which says to use the pytorch installation matrix, which produces:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
Those both need updating, I spent a while getting working environments recently on another project. Let me do a few tests and then update here.
Troubleshooting.
First, make sure you actually have a GPU instance running with nvidia drivers. You can do this by running nvidia-smi from a terminal
Once you have done this you can install a pytorch environment in one of 3 ways. The key issue here is that the pytorch channel and the conda-forge channel use different naming conventions.
- In the
pytorchchannel: the cpu parts are calledpytorchand the gpu version ispytorch-cudaand both ned to be installed. - In the
conda-forgechannel: the cpu version is calledpytorch-cpuand the gpu version is calledpytorch-gpubut also if you build the environment on a machine that doesn't have a gpu (like conda-store does), conda-forge tries to be clever and installs the non-gpu version even if you specifypytorch-gpu.pytorchis a metapackage that installs whichever is needed.
- Use the
pytorch,nvidia, anddefaultschannel. Do not useconda-forge.
channels:
- pytorch
- nvidia
- defaults
dependencies:
- python=3.11
- pytorch
- pytorch-cuda
- ipykernel
variables: {}
- Use the
pytorch,nvidiaandconda-forgechannels and pin bothpytorchandpytorch-cudato come from the pytorch channel otherwise the environment accidentally gets the cpu onlypytorchfrom conda-forge.
channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.11
- pytorch::pytorch
- pytorch::pytorch-cuda
- ipykernel
variables: {}
- Only use the
conda-forgechannel but setCONDA_OVERRIDE_CUDA: "12.0"to force the GPU version.
channels:
- conda-forge
dependencies:
- python=3.11
- pytorch
- ipykernel
variables:
CONDA_OVERRIDE_CUDA: "12.0"
I've tested all 3 of these on an AWS deployment with v2024.3.3 and conda-store v2024.3.1
It is possible @rsignell-usgs environment failed because it specified both pytorch:pytorch-cuda=11.8 and CONDA_OVERRIDE_CUDA: "12.0"
I'm testing that now.
Actually. @rsignell-usgs environment worked as well. Rich can you run nvidia-smi and post the output.
channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.11
- pytorch::pytorch
- pytorch::pytorch-cuda=11.8
- numpy
- ipykernel
variables:
CONDA_OVERRIDE_CUDA: "12.0"
Here is a redacted aws yaml for the deployment I was on.
amazon_web_services:
region: us-gov-west-1
kubernetes_version: '1.26'
node_groups:
general:
instance: m5.2xlarge
min_nodes: 2
max_nodes: 5
user:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
worker:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
gpu-tesla-g4:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 5
single_subnet: false
gpu: true
gpu-tesla-g4-4x:
instance: g4dn.12xlarge
min_nodes: 0
max_nodes: 5
single_subnet: false
gpu: true
gpu-tesla-g3-2x:
instance: g3.8xlarge
min_nodes: 0
max_nodes: 5
single_subnet: false
gpu: true
profiles:
jupyterlab:
- display_name: Micro Instance
access: yaml
groups:
- developer
- admin
description: Stable environment with 0.5-1 cpu / 0.5-1 GB ram
kubespawner_override:
cpu_limit: 1
cpu_guarantee: 0.5
mem_limit: 1G
mem_guarantee: 0.5G
node_selector:
"dedicated": "user"
- display_name: Small Instance
description: Stable environment with 1.5-2 cpu / 6-8 GB ram
default: true
kubespawner_override:
cpu_limit: 2
cpu_guarantee: 1.5
mem_limit: 8G
mem_guarantee: 6G
node_selector:
"dedicated": "user"
- display_name: Medium Instance
description: Stable environment with 3-4 cpu / 12-16 GB ram
kubespawner_override:
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 12G
node_selector:
"dedicated": "user"
- display_name: G4 GPU Instance 1x
access: yaml
groups:
- gpu-access
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 1
node_selector:
beta.kubernetes.io/instance-type: "g4dn.xlarge"
- display_name: G4 GPU Instance 4x
access: yaml
groups:
- gpu-access
description: 48 cpu / 192GB RAM / 4 Nvidia T4 GPU (64 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 48
cpu_guarantee: 40
mem_limit: 192G
mem_guarantee: 150G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 4
node_selector:
beta.kubernetes.io/instance-type: "g4dn.12xlarge"
- display_name: G3 GPU Instance 2x
access: yaml
groups:
- gpu-access
description: 32 cpu / 244GB RAM / 2 Nvidia Tesla M60 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 32
cpu_guarantee: 30
mem_limit: 244G
mem_guarantee: 200G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 2
node_selector:
beta.kubernetes.io/instance-type: "g3.8xlarge"
cc: @pavithraes re: conflicting GPU best practices pages in docs.
@dharhas thanks for this info! Indeed when I open a terminal and activate our ML environment, when I type
nvidia-smi I get "command not found". And when I google that it says that if it's not found, I need to install the nebari-utils package on the system. Is that expected to be found in the base GPU container?
My config section for the GPU instance looks like:
- display_name: G4 GPU Instance 1x
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
node_selector:
"dedicated": "gpu-1x-t4"
I notice we don't have these lines in our config:
extra_resource_limits:
nvidia.com/gpu: 1
because we took those out while trying to get the GPU instance to launch, right @pt247 ?
Hi @rsignell, I will review this today as part of the above issue and follow up once I test those steps.
Thanks @viniciusdc ! I'm hoping to use this on Thursday for my short course!
@rsignell just as a followback, I think I found the issue with the config above. I will test the config that should work now, and paste here for you to test as well
@rsignell I just tested the following on a fresh install in AWS. Let me know if this fixes your problem:
- your node selectors need to match the node group name, not the instance type:
node_selector:
"dedicated": "gpu-tesla-g4" # based on what your node groups look like from the comments above
(Here's a config file that worked for me for example)
amazon_web_services:
kubernetes_version: "1.29"
region: us-east-1
node_groups:
...
gpu-tesla-g4:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 5
single_subnet: false
gpu: true
profiles:
jupyterlab:
...
- display_name: G4 GPU Instance 1x
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 1
node_selector:
"dedicated": "gpu-tesla-g4"
- As Dharhas mentioned above, this is an environment that builds and works:
channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.11
- pytorch::pytorch
- pytorch::pytorch-cuda
- ipykernel
variables: {}
I added a recording with a pytorch basic execution example as well:
https://github.com/nebari-dev/nebari/assets/51954708/babe9200-3a6b-4a4b-ab8e-82e284827457
Following up on our previous discussions, I met with Rich yesterday to talk about the issue of GPU spawning. We ended up fixing the problem when I shared my config with him. Later, by comparing the diff of them, I identified the main issues as follows:
- One of the profiles attempted to reference a nonexistent node_group due to a typo.
- The GPU node group was missing a gpu: true key, preventing the driver installation daemon from installing necessary dependencies for GPU connectivity, which is why nvidia-smi was not functioning.
- Interestingly, it appears possible to request a GPU even without the gpu: true flag; however, the request fails due to missing drivers. Additionally, specifying
nvidia/gpu: <number-of-gpus>causes the profile to time out.
What can we do moving forward? We must improve our schema to ensure it correctly links and validates such relationships. Once the schema is reworked, the last issue should be resolved, as it likely stems from unsupported or unexpected configuration scenarios.
@rsignell I will keep this issue open until https://github.com/nebari-dev/nebari-docs/issues/427 is addressed
https://github.com/nebari-dev/nebari-docs/pull/471 address the issue with the setup of the conda environments and code execution within the jupyterlab profiles. However, there is still insufficient documentation regarding the nebari administration in correctly setting up the AWS node groups.
Since this has been addressed in the discussion above, and now the problem is docs-wise this issue will track the final work required https://github.com/nebari-dev/nebari-docs/issues/417
Feel free to re-open this if extended discussion is required.