kops enableCustomMetrics breaks cluster autoscaler on AWS

/kind bug

1. What kops version are you running? The command kops version, will display this information.

1.25.3

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

v1.25.6

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Kops edit and then update cluster adding: spec: kubelet: enableCustomMetrics: true

5. What happened after the commands executed?

Node autoscaling events would trigger, but the cluster autoscaler would deny the scale: <pods> marked as unschedulable can be scheduled on node template-node-for-node-us-east-1a.cluster-7993510475406838098-upcoming-1. Ignoring in scale up.

6. What did you expect to happen?

Cluster autoscaler would add a node. The node name it says is available template-node-for-node-us-east-1a.cluster-7993510475406838098-upcoming-1 is not a real node name. It should look like an EC2 instance ID instead e.g. i-083ff5a9g48d940a5

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2018-05-09T22:23:26Z"
  generation: 89
  name: cluster
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "autoscaling:DescribeAutoScalingGroups",
            "autoscaling:DescribeAutoScalingInstances",
            "autoscaling:DescribeLaunchConfigurations",
            "autoscaling:SetDesiredCapacity",
            "autoscaling:TerminateInstanceInAutoScalingGroup",
            "autoscaling:DescribeTags"
          ],
          "Resource": "*"
        }
      ]
    node: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "acm:ListCertificates",
            "acm:DescribeCertificate",
            "autoscaling:DescribeAutoScalingGroups",
            "autoscaling:DescribeLoadBalancerTargetGroups",
            "autoscaling:AttachLoadBalancers",
            "autoscaling:DetachLoadBalancers",
            "autoscaling:DetachLoadBalancerTargetGroups",
            "autoscaling:AttachLoadBalancerTargetGroups",
            "cloudformation:*",
            "elasticloadbalancing:*",
            "elasticloadbalancingv2:*",
            "ec2:DescribeInstances",
            "ec2:DescribeSubnets",
            "ec2:DescribeSecurityGroups",
            "ec2:DescribeRouteTables",
            "ec2:DescribeVpcs",
            "iam:GetServerCertificate",
            "iam:ListServerCertificates"
          ],
          "Resource": ["*"]
        }
      ]
  api:
    loadBalancer:
      class: Classic
      crossZoneLoadBalancing: true
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  clusterAutoscaler:
    balanceSimilarNodeGroups: true
    enabled: true
    expander: least-waste
    newPodScaleUpDelay: 0s
    scaleDownUtilizationThreshold: "0.7"
    skipNodesWithLocalStorage: false
    skipNodesWithSystemPods: true
  configBase: s3://cluster/cluster
  etcdClusters:
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-east-1a
      kmsKeyId: <arn>
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-east-1b
      kmsKeyId: <arn>
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-east-1c
      kmsKeyId: <arn>
      name: c
    name: main
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-east-1a
      kmsKeyId: <arn>
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-east-1b
      kmsKeyId: <arn>
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-east-1c
      kmsKeyId: <arn>
      name: c
    name: events
  hooks:
  - manifest: |
      [Unit]
      Description=install crictl cli
      Before=kubelet.service
      [Service]
      Type=oneshot
      RemainAfterExit=true
      TimeoutStopSec=120s
      ExecStart=/bin/sh -c "wget https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.19.0/crictl-v1.19.0-linux-amd64.tar.gz && sudo tar zxvf crictl-v1.19.0-linux-amd64.tar.gz -C /usr/local/bin && rm -f crictl-v1.19.0-linux-amd64.tar.gz"
      [Install]
      WantedBy=multi-user.target
    name: crictl-cli.service
    roles:
    - Node
    useRawManifest: true
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    enableAdmissionPlugins:
    - PodNodeSelector
    - PodTolerationRestriction
    - <some oidc config>
  kubelet:
    enableCustomMetrics: true
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    maxPods: 352
    readOnlyPort: 0
    registryPullQPS: 0
    serializeImagePulls: false
  kubernetesApiAccess:
    <some ips>
  kubernetesVersion: 1.25.6
  masterInternalName: api.internal.cluster
  masterPublicName: api.cluster
  metricsServer:
    enabled: true
    insecure: true
  networkCIDR: <ip>
  networking:
    calico:
      wireguardEnabled: true
  nodeTerminationHandler:
    enabled: true
  nonMasqueradeCIDR: <ip>
  sshAccess:
   <some ips>
  subnets:
   <subnets>
  topology:
    bastion:
      bastionPublicName: bastion.cluster
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2018-05-09T22:23:27Z"
  generation: 12
  labels:
    kops.k8s.io/cluster: cluster
  name: bastions
spec:
  autoscale: false
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230112
  machineType: t2.micro
  maxSize: 1
  minSize: 1
  nodeLabels:
    ddagentapm: "no"
    kops.k8s.io/instancegroup: bastions
  role: Bastion
  subnets:
  - utility-us-east-1a
  - utility-us-east-1b
  - utility-us-east-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2018-05-09T22:23:26Z"
  generation: 21
  labels:
    kops.k8s.io/cluster: cluster
  name: master-us-east-1a
spec:
  autoscale: false
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230112
  machineType: r5a.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    ddagentapm: "no"
    kops.k8s.io/instancegroup: master-us-east-1a
  role: Master
  subnets:
  - us-east-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2018-05-09T22:23:27Z"
  generation: 20
  labels:
    kops.k8s.io/cluster: cluster
  name: master-us-east-1b
spec:
  autoscale: false
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230112
  machineType: r5a.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    ddagentapm: "no"
    kops.k8s.io/instancegroup: master-us-east-1b
  role: Master
  subnets:
  - us-east-1b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2018-05-09T22:23:27Z"
  generation: 20
  labels:
    kops.k8s.io/cluster: cluster
  name: master-us-east-1c
spec:
  autoscale: false
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230112
  machineType: r5a.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    ddagentapm: "no"
    kops.k8s.io/instancegroup: master-us-east-1c
  role: Master
  subnets:
  - us-east-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-10-20T01:04:43Z"
  generation: 66
  labels:
    kops.k8s.io/cluster: cluster
  name: node-us-east-1a
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/cluster: ""
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230112
  machineType: r5a.4xlarge
  maxSize: 20
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - r5a.4xlarge
    - r5n.xlarge
    onDemandAboveBase: 0
    onDemandAllocationStrategy: prioritized
    onDemandBase: 2
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    ddagentapm: "no"
    kops.k8s.io/instancegroup: node-us-east-1a
  role: Node
  rootVolumeSize: 256
  subnets:
  - us-east-1a

Aug 16 '23 20:08 arindosCredible

--enable-custom-metrics causes my cluster to no longer add new nodes. It's because --enable-custom-metrics, is not a flag for kublet, but it adds it anyway.

Sep 20 '23 00:09 myles-vibrent

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 28 '24 16:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 27 '24 17:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 28 '24 18:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 28 '24 18:03 k8s-ci-robot