kops Kubelet no long serving metrics from 10255

/kind bug

1. What kops version are you running? The command kops version, will display this information. Client version: 1.25.3 (git-v1.25.3)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. v1.22.17

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? kops update

5. What happened after the commands executed? Cluster updates, but the kubelet no longer runs metrics on port 10255

6. What did you expect to happen? kubelet to run metrics on port 10255

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

kind: Cluster
metadata:
  creationTimestamp: "2019-01-18T20:02:03Z"
  generation: 1
  name: mycluster.example.com
spec:
  additionalPolicies:
    ....
  api:
    loadBalancer:
      class: Classic
      idleTimeoutSeconds: 4000
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    Environment: Dev
    EnvironmentLevel: dev
    OS: Ubuntu
    Platform: Linux
    Project: PMI
    Type: Private
  cloudProvider: aws
  configBase: s3://kops-state-mycluster.example.com/mycluster.example.com
  containerRuntime: docker
  etcdClusters:
  - etcdMembers:
   ...
  kubeAPIServer:
    auditLogMaxAge: 10
    auditLogMaxBackups: 1
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    auditPolicyFile: /srv/kubernetes/kube-apiserver/audit.yaml
    enableAdmissionPlugins:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - NodeRestriction
    - ResourceQuota
    oidcClientID: dev-cluster
    oidcGroupsClaim: groups
    oidcGroupsPrefix: 'someplace:'
    oidcIssuerURL: https://login.mycluster.example.com/dex
    oidcUsernameClaim: email
  kubeDNS:
    coreDNSImage: coredns/coredns:1.6.9
    provider: CoreDNS
  kubeProxy:
    conntrackMaxPerCore: 1310720
    metricsBindAddress: 0.0.0.0
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    resolvConf: /run/systemd/resolve/resolv.conf
  kubernetesApiAccess:
  - 24.126.31.95/32
  - 174.64.55.14/32
  - 34.230.102.25/32
  - 10.200.0.0/20
  - 27.107.30.170/32
  - 203.109.100.186/32
  - 54.164.170.205/32
  - 18.209.243.138/32
  kubernetesVersion: 1.22.17
  masterInternalName: api.internal.mycluster.example.com
  masterPublicName: api.mycluster.example.com
  networkCIDR: 10.40.16.0/20
  networking:
    cni: {}
  ...
  topology:
    bastion:
      bastionPublicName: bastion.mycluster.example.com
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2018-06-30T20:19:34Z"
  labels:
    kops.k8s.io/cluster: mycluster.example.com
    spotinst.io/hybrid: "true"
    spotinst.io/ocean-default-launchspec: "true"
    spotinst.io/restrict-scale-down: "true"
  name: app-worker-a
spec:
  cloudLabels:
    Department: TechOps
    Environment: mycluster.example.com
    EnvironmentLevel: dev
    Group: MULTI
    Service: app
    Tier: app
    k8s.io/cluster-autoscaler/mycluster.example.com: "true"
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/node-template/label/tier: app
  image: ami-09bcccc9d180cf61d
  machineType: r6a.2xlarge
  maxSize: 20
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: app-worker-a
    tier: app
  role: Node
  rootVolumeEncryption: true
  subnets:
  - app-worker-a

...

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I believe the kubelet configuration is defaulting to disabling kubelet metrics.

The flag --read-only-port is not defined. At some point I think kubelet changed the default behaviour. It used to default to 10255 to disabled.

This is breaking all the pod, container metrics on our cluster.

Sep 20 '23 01:09 myles-vibrent

Hi @myles-vibrent, I think https://github.com/kubernetes/kubernetes/pull/100335 is related. Did you try to set readOnlyPort: 10255 ? https://github.com/kubernetes/kops/blob/6dd35e2561140ab0ab1ad88ea719eeff34d6dc09/pkg/apis/kops/componentconfig.go#L85-L86

Sep 20 '23 03:09 hakman

Yes I can manually goto the node, and add the readonly port to 10255 and it will start up.

and it's definitely related to that. Let me try to add that to the kops config. I didn't see it in the documentation, so I did not try it.

Sep 20 '23 15:09 myles-vibrent

Let me know how it works 😄

Sep 21 '23 08:09 hakman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 28 '24 19:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 27 '24 20:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 28 '24 21:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 28 '24 21:03 k8s-ci-robot