nebari icon indicating copy to clipboard operation
nebari copied to clipboard

[BUG] - AWS instance type not properly respected when `gpu` are enabled

Open viniciusdc opened this issue 1 year ago • 0 comments

Describe the bug

Since the latest release, when #2604 changes were integrated, a bug was introduced due to how we currently load our schema and perform validation versus the way the stages files are rendered during deploy. Basicaly, in that PR we changed the behavior on how the instance_types (AL2_x86_64_GPU, AL2_x86_64 and CUSTOM) are forwarded to their respective terraform variables under the node_groups.

Right now, when utilizing the following config block for example:

amazon_web_services:
  ...
  node_groups:
    ...
    gpu-1x-t4:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
      gpu: true
profiles:
  jupyterlab:
   - display_name: G4 GPU Instance 1x
      description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
      kubespawner_override:
        image: quay.io/nebari/nebari-jupyterlab-gpu:2024.9.1
        cpu_limit: 4
        cpu_guarantee: 3
        mem_limit: 16G
        mem_guarantee: 10G
        extra_pod_config:
          volumes:
            - name: "dshm"
              emptyDir:
                medium: "Memory"
                sizeLimit: "2Gi"
        extra_container_config:
          volumeMounts:
            - name: "dshm"
              mountPath: "/dev/shm"
        node_selector:
          "dedicated": "gpu-1x-t4"

The expected behavior would be for an instance with a GPU to be spawned and assigned to the user's pod right now, though. The instance is correctly scaled up, though the type is wrongly defaulted to ``AL2_x86_64_GPU`, which results in the incorrect AMI being assigned to the instance and the NVIDIA drivers expected to be installed by the daemon never triggering.

The problem arises from this part of our code: https://github.com/nebari-dev/nebari/blob/ccb8b7eff9e77dbc7da4c106ddfb842a331a8a3b/src/_nebari/stages/infrastructure/init.py#L142-L172

I suggest that we remove the "dynamic" handling of the instance type from the Pydantic validator and instead use a custom function to handle the proper logic at run time, for example:

def construct_aws_ami_type(
    gpu_enabled: bool, launch_template: Dict, ami_type: str = None
):
    """Construct the AWS AMI type based on the provided parameters."""
    if ami_type:
        return ami_type

    if launch_template and launch_template.get("ami_id"):
        return "CUSTOM"

    if gpu_enabled:
        return "AL2_x86_64_GPU"

    return "AL2_x86_64"

and there is also a need for changing the current Enum object, as it also is not properly serializable right now:

class AWSAmiTypes(str, enum.Enum):
    AL2_x86_64 = "AL2_x86_64"
    AL2_x86_64_GPU = "AL2_x86_64_GPU"
    CUSTOM = "CUSTOM"

Expected behavior

Gpus instances should scale properly while their drivers are properly installed as well

OS and architecture in which you are running Nebari

Linux

How to Reproduce the problem?

Run an AWS deployment that requires a GPU profile, bug introduced in latest release version (2024.9.1)

Command output

No response

Versions and dependencies used.

No response

Compute environment

AWS

Integrations

No response

Anything else?

No response

viniciusdc avatar Oct 21 '24 12:10 viniciusdc