[BUG] - AWS instance type not properly respected when `gpu` are enabled
Describe the bug
Since the latest release, when #2604 changes were integrated, a bug was introduced due to how we currently load our schema and perform validation versus the way the stages files are rendered during deploy. Basicaly, in that PR we changed the behavior on how the instance_types (AL2_x86_64_GPU, AL2_x86_64 and CUSTOM) are forwarded to their respective terraform variables under the node_groups.
Right now, when utilizing the following config block for example:
amazon_web_services:
...
node_groups:
...
gpu-1x-t4:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
gpu: true
profiles:
jupyterlab:
- display_name: G4 GPU Instance 1x
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.9.1
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
node_selector:
"dedicated": "gpu-1x-t4"
The expected behavior would be for an instance with a GPU to be spawned and assigned to the user's pod right now, though. The instance is correctly scaled up, though the type is wrongly defaulted to ``AL2_x86_64_GPU`, which results in the incorrect AMI being assigned to the instance and the NVIDIA drivers expected to be installed by the daemon never triggering.
The problem arises from this part of our code: https://github.com/nebari-dev/nebari/blob/ccb8b7eff9e77dbc7da4c106ddfb842a331a8a3b/src/_nebari/stages/infrastructure/init.py#L142-L172
I suggest that we remove the "dynamic" handling of the instance type from the Pydantic validator and instead use a custom function to handle the proper logic at run time, for example:
def construct_aws_ami_type(
gpu_enabled: bool, launch_template: Dict, ami_type: str = None
):
"""Construct the AWS AMI type based on the provided parameters."""
if ami_type:
return ami_type
if launch_template and launch_template.get("ami_id"):
return "CUSTOM"
if gpu_enabled:
return "AL2_x86_64_GPU"
return "AL2_x86_64"
and there is also a need for changing the current Enum object, as it also is not properly serializable right now:
class AWSAmiTypes(str, enum.Enum):
AL2_x86_64 = "AL2_x86_64"
AL2_x86_64_GPU = "AL2_x86_64_GPU"
CUSTOM = "CUSTOM"
Expected behavior
Gpus instances should scale properly while their drivers are properly installed as well
OS and architecture in which you are running Nebari
Linux
How to Reproduce the problem?
Run an AWS deployment that requires a GPU profile, bug introduced in latest release version (2024.9.1)
Command output
No response
Versions and dependencies used.
No response
Compute environment
AWS
Integrations
No response
Anything else?
No response