terraform-aws-eks icon indicating copy to clipboard operation
terraform-aws-eks copied to clipboard

eks_managed_group network_interfaces device_index unable to setup multiple nics correctly with a launch template

Open flowinh2o opened this issue 2 years ago β€’ 0 comments

Description

When attempting to setup a managed node group containing an instance type that supports multiple NICs such as a p4d.24xlarge the launch template is setup incorrectly resulting nodes being unable to start

Versions

  • Module version: 19.15.3

  • Terraform version: 1.5.0

  • Provider version(s): Terraform v1.5.0 on darwin_arm64

  • provider registry.terraform.io/hashicorp/aws v5.3.0
  • provider registry.terraform.io/hashicorp/cloudinit v2.3.2
  • provider registry.terraform.io/hashicorp/kubernetes v2.21.1
  • provider registry.terraform.io/hashicorp/time v0.9.1
  • provider registry.terraform.io/hashicorp/tls v4.0.4

Reproduction Code

I am using the https://github.com/terraform-aws-modules/terraform-aws-eks/tree/v19.15.3/examples/eks_managed_node_group and have replaced all nodes groups with this config

  gpu_a100_80g = {
      ami_type       = "AL2_x86_64_GPU"
      subnet_ids     = [module.vpc.private_subnets[0]]
      desired_size   = 0
      min_size       = 0
      max_size       = 4
      instance_types = ["p4d.24xlarge"]
      tags = {
        "eks.absci-ai.cloud/node-purpose" = "gpu_a100_80g"
      }
      labels = {
        "eks.absci-ai.cloud/node-purpose" = "gpu_a100_80g"
        "k8s.amazonaws.com/accelerator"   = "nvidia-tesla-a100"
      }
      network_interfaces = [
        {
          description                 = "EFA interface 1"
          delete_on_termination       = true
          device_index                = 0
          associate_public_ip_address = false
          interface_type              = "efa"
          efa_enabled                 = true
        },
        {
          description                 = "EFA interface 2"
          delete_on_termination       = true
          device_index                = 1
          associate_public_ip_address = false
          interface_type              = "efa"
          efa_enabled                 = true
        },
        {
          description                 = "EFA interface 3"
          delete_on_termination       = true
          device_index                = 2
          associate_public_ip_address = false
          interface_type              = "efa"
          efa_enabled                 = true
        },
        {
          description                 = "EFA interface 4"
          delete_on_termination       = true
          device_index                = 3
          associate_public_ip_address = false
          interface_type              = "efa"
          efa_enabled                 = true
        }
      ]
      pre_bootstrap_user_data = <<-EOT
        # Install EFA
        curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
        tar -xf aws-efa-installer-latest.tar.gz && cd aws-efa-installer
        ./efa_installer.sh -y
        /opt/amazon/efa/bin/fi_info -p efa -t FI_EP_RDM > /tmp/efa_info
        # Disable ptrace
        sysctl -w kernel.yama.ptrace_scope=0
        EOT
    }
  }

Steps to reproduce the behavior:

Run the example above and then try and scale up the node group.

Expected behavior

Instance should be able to be start up.

Actual behavior

Unable to launch an instance due to incorrect NIC configurations in the launch config

Additional context

Here is a screen shot of what the network cards looks like with the incorrect index Screenshot 2023-06-14 at 11 18 18 AM

And for reference here is what a working configuration looks like using eksctl that supports EFA and multiple NICs.

Screenshot 2023-06-14 at 11 17 57 AM

flowinh2o avatar Jun 15 '23 18:06 flowinh2o