[BUG] - GCP GPUs only enabled when set via `guest_accelerator`
Describe the bug
For GCP deployments that want to use A100 (or similar) GPUs, there is no way of making sure the nvidia drivers are installed (via this daemonset).
This is because currently, we are checking if the profile has guest_accelerator, then deploying the daemonset mentioned above (see here). Most GPUs on GCP are attached to CPU instances and use this guest_accelerator to specify the desired GPU. For GPUs like A100, this is not the case.
Expected behavior
We need to make sure the nvidia drivers are installed even for GPUs like A100.
OS and architecture in which you are running Nebari
Ubuntu
How to Reproduce the problem?
On a Nebari cluster running on GCP, add the following profile and try to launch a server:
google_cloud_platform:
...
node_groups:
gpu-ampere-a100-x1:
instance: a2-highgpu-1g # 1x 40 GB HBM2: Nvidia Ampere A100
min_nodes: 0
max_nodes: 1
profiles:
jupyterlab:
- display_name: A100 GPU Instance 1x
description: GPU instance with 12cpu 85GB / 1 Nvidia A100 GPU
kubespawner_override:
...
node_selector:
"cloud.google.com/gke-nodepool": "gpu-ampere-a100-x1"
Command output
No response
Versions and dependencies used.
No response
Compute environment
None
Integrations
No response
Anything else?
No response
@iameskild did we tried passing the guest_accelerator field toghether with the node_selector?
Another way would be disabling the check for guest_accelerator when a certain flag is passed... (we could use the node_selector as well for this). Another direction would be refactoring both the validation logic as well as the way we pass GPU config over
What if we decoupled the gpu's from the profile section, and made it have its own logic?
GPU:
enabled: true
- profile: ...... # target profile to use GPU
family_type: a2-highgpu-1g|gpu-ampere-a100-x1 # or we can pass a node_selector value instead and we do the logic before sending to terraform
Thanks for the feedback @viniciusdc!
I haven't tried adding guest_accelerator partly because it wasn't necessary and partly because adding it might have undesired effects.
I think the guest_accelerator check makes sense but perhaps having another section, as you mentioned, would be helpful as well. Since we only need the nvidia-driver daemonset to be applied once, what if we just added:
google_cloud_platform:
gpu:
enabled: true
node_groups:
...
As far as I can tell, AWS needs to know the node-group name but GCP just needs the single daemonset applied.
And has a final thought, why not just have the daemonset applied for all deployments?
cc @costrouc
@Adam-D-Lewis Do you know if this has been fixed?
We were able to run A100 on nebari, by updating the nvidia drivers (that change was merged into nebari).
I don't remember having an issue with guest accelerators that they describe