kops icon indicating copy to clipboard operation
kops copied to clipboard

need help on getting my node join my cluster using kops

Open maxime202400 opened this issue 1 year ago • 4 comments

i recently wanted to upgrade my version from 1.28.10 and during the upgrade some of the node are not joining the cluster

this is the error that iam seeing when i run the kops validate

Validating cluster

INSTANCE GROUPS NAME ROLE MACHINETYPE MIN MAX SUBNETS master-us-east-2a ControlPlane m7a.large 1 1 us-east-2a master-us-east-2b ControlPlane m7a.large 1 1 us-east-2b master-us-east-2c ControlPlane m7a.large 1 1 us-east-2c nodes Node m6a.large 3 18 us-east-2a,us-east-2b,us-east-2c

NODE STATUS NAME ROLE READY node True node True node True

VALIDATION ERRORS KIND NAME MESSAGE Machine machine "" has not yet joined cluster Machine machine "" has not yet joined cluster Machine machine "" has not yet joined cluster

Validation Failed Error: validation failed: cluster not yet healthy

and when i run the kubelet log on the probamatic node Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1/apis/storage.k8s.io/v1/csinodes/i-": dial tcp 127.0.0.1:443: connect: connection refused

maxime202400 avatar Jul 10 '24 00:07 maxime202400

/kind support

kundan2707 avatar Jul 13 '24 18:07 kundan2707

Hi, I am facing this exact problem.

kops version:

Client version: 1.29.2 (git-v1.29.2)

k8s version:

1.24.16

The error that I am seeing in /var/log/syslog of the master node is this:

Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.157295    3103 csi_plugin.go:1021] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1/apis/storage.k8s.io/v1/csinodes/i-xxxx": dial tcp 127.0.0.1:443: connect: connection refused
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.220919    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.228123    3103 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.234155    3103 kubelet_node_status.go:563] "Recording event message for node" node="i-xxxx" event="NodeHasSufficientMemory"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.234212    3103 kubelet_node_status.go:563] "Recording event message for node" node="i-xxxx" event="NodeHasNoDiskPressure"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.234233    3103 kubelet_node_status.go:563] "Recording event message for node" node="i-xxxx" event="NodeHasSufficientPID"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.234266    3103 kubelet_node_status.go:70] "Attempting to register node" node="i-xxxx"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.235047    3103 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://127.0.0.1/api/v1/nodes\": dial tcp 127.0.0.1:443: connect: connection refused" node="i-xxxx"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.321805    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.422894    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.523918    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.624965    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.725940    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.826678    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.927813    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:21 ip-a-b-c-d kubelet[3103]: E0808 19:26:21.028995    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"

I am using kops version 1.29.2 because I need to use the wildcard namespace feature for IRSA.

The cluster spec is here:

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  generation: 1
  name: k8s-124.foo.bar.com
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect": "Allow",
          "Action": ["ec2:ModifyInstanceAttribute"],
          "Resource": ["*"]
        }
      ]
  api:
    loadBalancer:
      class: Network
      type: Public
  authorization:
    rbac: {}
  certManager:
    enabled: true
  channel: stable
  cloudLabels:
    App: k8s-124
    Env: foo
    Region: eu-west-1
  cloudProvider: aws
  clusterAutoscaler:
    awsUseStaticInstanceList: false
    balanceSimilarNodeGroups: false
    cpuRequest: 100m
    enabled: true
    expander: least-waste
    memoryRequest: 300Mi
    newPodScaleUpDelay: 0s
    scaleDownDelayAfterAdd: 10m0s
    scaleDownUnneededTime: 5m0s
    scaleDownUnreadyTime: 10m0s
    scaleDownUtilizationThreshold: "0.6"
    skipNodesWithLocalStorage: true
    skipNodesWithSystemPods: true
  configBase: s3://my-bucket/prefix
  dnsZone: xxxx
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
    memoryRequest: 100Mi
    name: events
  externalPolicies:
    master:
    - arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess
    node:
    - arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess
  fileAssets:
  - content: |
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules:
      - level: Metadata
    name: audit-policy-config
    path: /srv/kubernetes/kube-apiserver/audit/policy-config.yaml
    roles:
    - Master
  - content: |
      apiVersion: v1
      kind: Config
      clusters:
      - name: bar
        cluster:
          server: https://audit-logs-receiver-endpoint/some-token
      contexts:
      - context:
          cluster: bar
          user: ""
        name: default-context
      current-context: default-context
      preferences: {}
      users: []
    name: audit-webhook-config
    path: /var/log/audit/webhook-config.yaml
    roles:
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
    serviceAccountExternalPermissions:
    - aws:
        inlinePolicy: |-
          [
            {
              "Effect": "Allow",
              "Action": [
                "S3:*"
              ],
              "Resource": [
                "*"
              ]
            }
          ]
      name: s3perm
      namespace: '*'
  kubeAPIServer:
    auditLogMaxAge: 10
    auditLogMaxBackups: 1
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    auditPolicyFile: /srv/kubernetes/kube-apiserver/audit/policy-config.yaml
    auditWebhookBatchMaxWait: 5s
    auditWebhookConfigFile: /srv/kubernetes/kube-apiserver/audit/webhook-config.yaml
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    maxPods: 150
    shutdownGracePeriod: 1m0s
    shutdownGracePeriodCriticalPods: 30s
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.24.16
  masterPublicName: api.k8s-124.foo.bar.com
  networkCIDR: 10.8.0.0/16
  networkID: vpc-xxxx
  networking:
    cilium:
      hubble:
        enabled: true
  nonMasqueradeCIDR: 100.64.0.0/10
  podIdentityWebhook:
    enabled: true
  rollingUpdate:
    maxSurge: 4
  serviceAccountIssuerDiscovery:
    discoveryStore: s3://oidc-bucket/k8s-1-24-2
    enableAWSOIDCProvider: true
  sshAccess:
  - 0.0.0.0/0
  sshKeyName: kops
  subnets:
  - cidr: 1.2.3.4/19
    id: subnet-xx
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 4.3.2.1/22
    id: subnet-yy
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  topology:
    dns:
      type: Private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-08-08T19:50:22Z"
  labels:
    kops.k8s.io/cluster: k8s-124.foo.bar.com
  name: master-eu-west-1a
spec:
  image: ubuntu/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240411
  instanceMetadata:
    httpPutResponseHopLimit: 2
    httpTokens: required
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1a
  role: Master
  rootVolumeEncryption: true
  rootVolumeSize: 30
  subnets:
  - eu-west-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-08-08T19:50:23Z"
  labels:
    kops.k8s.io/cluster: k8s-124.foo.bar.com
  name: nodes-eu-west-1a
spec:
  additionalUserData:
  - content: |
      apt-get update
      apt-get install -y qemu-user-static
    name: 0prereqs.sh
    type: text/x-shellscript
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/k8s-124.foo.bar.com: ""
  image: ubuntu/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240411
  instanceMetadata:
    httpPutResponseHopLimit: 2
    httpTokens: required
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-eu-west-1a
  role: Node
  rootVolumeEncryption: true
  rootVolumeSize: 200
  subnets:
  - eu-west-1a

SohamChakraborty avatar Aug 08 '24 19:08 SohamChakraborty

@SohamChakraborty could you check kube-apiserver.log file for hints on the issue?

hakman avatar Aug 09 '24 16:08 hakman

Hi @hakman I have identified my issue. It was having some sort of problem with audit policy and audit webhook config files.

SohamChakraborty avatar Aug 19 '24 17:08 SohamChakraborty

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 17 '24 17:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 17 '24 18:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jan 16 '25 18:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jan 16 '25 18:01 k8s-ci-robot