kops icon indicating copy to clipboard operation
kops copied to clipboard

Kubelet no long serving metrics from 10255

Open myles-vibrent opened this issue 2 years ago • 5 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information. Client version: 1.25.3 (git-v1.25.3)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. v1.22.17

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? kops update

5. What happened after the commands executed? Cluster updates, but the kubelet no longer runs metrics on port 10255

6. What did you expect to happen? kubelet to run metrics on port 10255

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

kind: Cluster
metadata:
  creationTimestamp: "2019-01-18T20:02:03Z"
  generation: 1
  name: mycluster.example.com
spec:
  additionalPolicies:
    ....
  api:
    loadBalancer:
      class: Classic
      idleTimeoutSeconds: 4000
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    Environment: Dev
    EnvironmentLevel: dev
    OS: Ubuntu
    Platform: Linux
    Project: PMI
    Type: Private
  cloudProvider: aws
  configBase: s3://kops-state-mycluster.example.com/mycluster.example.com
  containerRuntime: docker
  etcdClusters:
  - etcdMembers:
   ...
  kubeAPIServer:
    auditLogMaxAge: 10
    auditLogMaxBackups: 1
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    auditPolicyFile: /srv/kubernetes/kube-apiserver/audit.yaml
    enableAdmissionPlugins:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - NodeRestriction
    - ResourceQuota
    oidcClientID: dev-cluster
    oidcGroupsClaim: groups
    oidcGroupsPrefix: 'someplace:'
    oidcIssuerURL: https://login.mycluster.example.com/dex
    oidcUsernameClaim: email
  kubeDNS:
    coreDNSImage: coredns/coredns:1.6.9
    provider: CoreDNS
  kubeProxy:
    conntrackMaxPerCore: 1310720
    metricsBindAddress: 0.0.0.0
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    resolvConf: /run/systemd/resolve/resolv.conf
  kubernetesApiAccess:
  - 24.126.31.95/32
  - 174.64.55.14/32
  - 34.230.102.25/32
  - 10.200.0.0/20
  - 27.107.30.170/32
  - 203.109.100.186/32
  - 54.164.170.205/32
  - 18.209.243.138/32
  kubernetesVersion: 1.22.17
  masterInternalName: api.internal.mycluster.example.com
  masterPublicName: api.mycluster.example.com
  networkCIDR: 10.40.16.0/20
  networking:
    cni: {}
  ...
  topology:
    bastion:
      bastionPublicName: bastion.mycluster.example.com
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2018-06-30T20:19:34Z"
  labels:
    kops.k8s.io/cluster: mycluster.example.com
    spotinst.io/hybrid: "true"
    spotinst.io/ocean-default-launchspec: "true"
    spotinst.io/restrict-scale-down: "true"
  name: app-worker-a
spec:
  cloudLabels:
    Department: TechOps
    Environment: mycluster.example.com
    EnvironmentLevel: dev
    Group: MULTI
    Service: app
    Tier: app
    k8s.io/cluster-autoscaler/mycluster.example.com: "true"
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/node-template/label/tier: app
  image: ami-09bcccc9d180cf61d
  machineType: r6a.2xlarge
  maxSize: 20
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: app-worker-a
    tier: app
  role: Node
  rootVolumeEncryption: true
  subnets:
  - app-worker-a

...

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I believe the kubelet configuration is defaulting to disabling kubelet metrics.

The flag --read-only-port is not defined. At some point I think kubelet changed the default behaviour. It used to default to 10255 to disabled.

This is breaking all the pod, container metrics on our cluster.

myles-vibrent avatar Sep 20 '23 01:09 myles-vibrent

Hi @myles-vibrent, I think https://github.com/kubernetes/kubernetes/pull/100335 is related. Did you try to set readOnlyPort: 10255 ? https://github.com/kubernetes/kops/blob/6dd35e2561140ab0ab1ad88ea719eeff34d6dc09/pkg/apis/kops/componentconfig.go#L85-L86

hakman avatar Sep 20 '23 03:09 hakman

Yes I can manually goto the node, and add the readonly port to 10255 and it will start up.

and it's definitely related to that. Let me try to add that to the kops config. I didn't see it in the documentation, so I did not try it.

myles-vibrent avatar Sep 20 '23 15:09 myles-vibrent

Let me know how it works 😄

hakman avatar Sep 21 '23 08:09 hakman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 28 '24 19:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 27 '24 20:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 28 '24 21:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 28 '24 21:03 k8s-ci-robot