kops icon indicating copy to clipboard operation
kops copied to clipboard

Kops cluster upgrade from 1.28.7 to 1.29.2 - warmpool instances join cluster and remain in notReady state

Open denihot opened this issue 1 year ago • 18 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information. 1.29.2 (git-v1.29.2) 2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. v1.29.6 3. What cloud provider are you using? AWS 4. What commands did you run? What is the simplest way to reproduce this issue? After editing kops config with the new k8s version I ran the following commands: kops get assets --copy --state $KOPS_REMOTE_STATE kops update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE --allow-kops-downgrade kops update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE kops rolling-update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE kops rolling-update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE --post-drain-delay 75s --drain-timeout 30m

5. What happened after the commands executed?

The cluster initiation of the upgrade went smoothly. The master nodes were successfully updated; however, an issue arose during the update process of the warmPool autoscaling groups. The update became stuck as instances were being added to the cluster instead of simply undergoing warming up and subsequent powering off.

The following error was appearing in the kops update logs:

I1002 12:02:19.415658 31 instancegroups.go:565] Cluster did not pass validation, will retry in "30s": node "i-04b854ec78e845f96" of role "node" is not ready, system-node-critical pod "aws-node-4chll" is pending, system-node-critical pod "ebs-csi-node-wcz74" is pending, system-node-critical pod "efs-csi-node-7q2j8" is pending, system-node-critical pod "kube-proxy-i-04b854ec78e845f96" is pending, system-node-critical pod "node-local-dns-mdvq7" is pending.

Those nodes in the Kubernetes cluster were displayed as 'NotReady,SchedulingDisabled' when using the 'kubectl get nodes' command. I waited for 10 minutes, but there was no progress. Subsequently, I resorted to manually deleting the problematic nodes. This action successfully resolved the issue, allowing the cluster upgrade process to resume smoothly.

After completing the upgrade, I conducted another test by manually removing warmed-up nodes from the AWS console. This action led to the creation of new warmup nodes, which were subsequently added to the k8s cluster. These newly added nodes remained in a 'NotReady, SchedulingDisabled' state until I removed them manually.

Autoscaler logs for one of those nodes:

1002 13:02:34.149584 1 pre_filtering_processor.go:57] Node i-0cfcda3548f955e05 should not be processed by cluster autoscaler (no node group config)

And the relevant log line from the kops-controler:

E1002 13:02:10.796429 1 controller.go:329] "msg"="Reconciler error" "error"="error identifying node \"i-0cfcda3548f955e05\": found instance \"i-0cfcda3548f955e05\", but state is \"stopped\"" "Node"={"name":"i-0cfcda3548f955e05"} "controller"="node" "controllerGroup"="" "controllerKind"="Node" "name"="i-0cfcda3548f955e05" "namespace"="" "reconcileID"="b532008b-db8f-4273-90ad-f0bf9d40858c"

Also kube-system pods are pending to be created on those nodes for some reason:

NAMESPACE     NAME                                        READY   STATUS              RESTARTS   AGE
kube-system   aws-node-2dflq                              0/2     Init:0/1            0          52m
kube-system   aws-node-58x6z                              0/2     Init:0/1            0          46m
kube-system   aws-node-cmdrr                              0/2     Init:0/1            0          54m
kube-system   aws-node-sw7dv                              0/2     Init:0/1            0          50m
kube-system   ebs-csi-node-fbg7j                          0/3     ContainerCreating   0          50m
kube-system   ebs-csi-node-k5nx5                          0/3     ContainerCreating   0          52m
kube-system   ebs-csi-node-l82xf                          0/3     ContainerCreating   0          48m
kube-system   ebs-csi-node-qfg4w                          0/3     ContainerCreating   0          54m
kube-system   ebs-csi-node-ws7j2                          0/3     ContainerCreating   0          46m
kube-system   efs-csi-node-dwk4s                          0/3     ContainerCreating   0          46m
kube-system   efs-csi-node-g5bq8                          0/3     ContainerCreating   0          52m
kube-system   efs-csi-node-qg5qb                          0/3     ContainerCreating   0          54m
kube-system   efs-csi-node-tgcxj                          0/3     ContainerCreating   0          50m
kube-system   kube-proxy-i-0480ae46ad3230afc              0/1     Terminating         0          52m
kube-system   kube-proxy-i-04bb59a89abc8b937              0/1     Terminating         0          50m
kube-system   kube-proxy-i-0742a7e208af5b1ac              0/1     Terminating         0          46m
kube-system   kube-proxy-i-0ae3c43b10efef605              0/1     Terminating         0          54m
kube-system   node-local-dns-77r8p                        0/1     ContainerCreating   0          52m
kube-system   node-local-dns-tlcwg                        0/1     ContainerCreating   0          54m
kube-system   node-local-dns-vc4z2                        0/1     ContainerCreating   0          50m

6. What did you expect to happen? I anticipate the warmup nodes to be activated and subsequently shut down without being integrated into the cluster.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  generation: 4
  name: develop.company.com
spec:
  api:
    loadBalancer:
      class: Network
      sslCertificate: arn:aws:acm:eu-west-1:1234:certificate/1111
      type: Internal
  assets:
    containerProxy: public.ecr.aws/12344
    fileRepository: https://bucket.s3.eu-west-1.amazonaws.com/
  authentication:
    aws: {}
  authorization:
    rbac: {}
  certManager:
    defaultIssuer: selfsigned
    enabled: true
  channel: stable
  cloudLabels:
    Prometheus: "true"
    aws-region: eu-west-1
  cloudProvider: aws
  configBase: s3://tf-remotestate-eu-west-1-123456/kops/develop.company.com
  dnsZone: ###
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: eu-west-1a
    - instanceGroup: master-eu-west-1b
      name: eu-west-1b
    - instanceGroup: master-eu-west-1c
      name: eu-west-1c
    manager:
      env:
      - name: ETCD_LISTEN_METRICS_URLS
        value: http://0.0.0.0:8081
      - name: ETCD_METRICS
        value: basic
    memoryRequest: 100Mi
    name: main
    version: 3.4.13
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: eu-west-1a
    - instanceGroup: master-eu-west-1b
      name: eu-west-1b
    - instanceGroup: master-eu-west-1c
      name: eu-west-1c
    memoryRequest: 100Mi
    name: events
    version: 3.4.13
  externalPolicies:
    master:
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    node:
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    - arn:aws:iam::1234:policy/nodes-extra.develop.company.com
  fileAssets:
  - content: |
      # https://raw.githubusercontent.com/kubernetes/website/master/content/en/examples/audit/audit-policy.yaml
      apiVersion: audit.k8s.io/v1 # This is required.
      kind: Policy
      # Don't generate audit events for all requests in RequestReceived stage.
      omitStages:
        - "RequestReceived"
      rules:
        # Log pod changes at RequestResponse level
        - level: RequestResponse
          resources:
          - group: ""
            # Resource "pods" doesn't match requests to any subresource of pods,
            # which is consistent with the RBAC policy.
            resources: ["pods"]
        # Log "pods/log", "pods/status" at Metadata level
        - level: Metadata
          resources:
          - group: ""
            resources: ["pods/log", "pods/status"]
        # Don't log requests to a configmap called "controller-leader"
        - level: None
          resources:
          - group: ""
            resources: ["configmaps"]
            resourceNames: ["controller-leader"]
        # Don't log watch requests by the "system:kube-proxy" on endpoints or services
        - level: None
          users: ["system:kube-proxy"]
          verbs: ["watch"]
          resources:
          - group: "" # core API group
            resources: ["endpoints", "services"]
        # Don't log authenticated requests to certain non-resource URL paths.
        - level: None
          userGroups: ["system:authenticated"]
          nonResourceURLs:
          - "/api*" # Wildcard matching.
          - "/version"
        # Log the request body of configmap changes in kube-system.
        - level: Request
          resources:
          - group: "" # core API group
            resources: ["configmaps"]
          # This rule only applies to resources in the "kube-system" namespace.
          # The empty string "" can be used to select non-namespaced resources.
          namespaces: ["kube-system"]
        # Log configmap and secret changes in all other namespaces at the Metadata level.
        - level: Metadata
          resources:
          - group: "" # core API group
            resources: ["secrets", "configmaps"]
        # Log all other resources in core and extensions at the Request level.
        - level: Request
          resources:
          - group: "" # core API group
          - group: "extensions" # Version of group should NOT be included.
        # A catch-all rule to log all other requests at the Metadata level.
        - level: Metadata
          # Long-running requests like watches that fall under this rule will not
          # generate an audit event in RequestReceived.
          omitStages:
            - "RequestReceived"
    name: kubernetes-audit.yaml
    path: /srv/kubernetes/assets/audit.yaml
    roles:
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
    serviceAccountExternalPermissions:
    - aws:
        policyARNs:
        - arn:aws:iam::1234:policy/dub-company-aws-efs-csi-driver
      name: efs-csi-controller-sa
      namespace: kube-system
    - aws:
        policyARNs:
        - arn:aws:iam::1234:policy/dub-company-aws-lb-controller
      name: aws-lb-controller-aws-load-balancer-controller
      namespace: kube-system
    - aws:
        policyARNs:
        - arn:aws:iam::1234:policy/dub-company-cluster-autoscaler
      name: cluster-autoscaler-aws-cluster-autoscaler
      namespace: kube-system
  kubeAPIServer:
    authenticationTokenWebhookConfigFile: /srv/kubernetes/aws-iam-authenticator/kubeconfig.yaml
    runtimeConfig:
      autoscaling/v2beta1: "true"
  kubeControllerManager:
    horizontalPodAutoscalerCpuInitializationPeriod: 20s
    horizontalPodAutoscalerDownscaleDelay: 5m0s
    horizontalPodAutoscalerDownscaleStabilization: 5m0s
    horizontalPodAutoscalerInitialReadinessDelay: 20s
    horizontalPodAutoscalerSyncPeriod: 5s
    horizontalPodAutoscalerTolerance: 100m
    horizontalPodAutoscalerUpscaleDelay: 3m0s
  kubeDNS:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kops.k8s.io/instancegroup
              operator: In
              values:
              - workers-misc
    externalCoreFile: |
      amazonaws.com:53 {
            errors
            log . {
                class denial error
            }
            health :8084
            prometheus :9153
            forward . 169.254.169.253 {
            }
            cache 30
        }
        .:53 {
            errors
            health :8080
            ready :8181
            autopath @kubernetes
            kubernetes cluster.local {
                pods verified
                fallthrough in-addr.arpa ip6.arpa
            }
            prometheus :9153
            forward . 169.254.169.253
            cache 300
        }
    nodeLocalDNS:
      cpuRequest: 25m
      enabled: true
      memoryRequest: 5Mi
    provider: CoreDNS
    tolerations:
    - effect: NoSchedule
      operator: Exists
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    maxPods: 35
    resolvConf: /etc/resolv.conf
  kubernetesApiAccess:
  - 10.0.0.0/8
  kubernetesVersion: 1.29.6
  masterPublicName: api.develop.company.com
  networkCIDR: 10.0.128.0/20
  networkID: vpc-1234
  networking:
    amazonvpc:
      env:
      - name: WARM_IP_TARGET
        value: "5"
      - name: MINIMUM_IP_TARGET
        value: "8"
      - name: DISABLE_METRICS
        value: "true"
  nonMasqueradeCIDR: 100.64.0.0/10
  podIdentityWebhook:
    enabled: true
  rollingUpdate:
    maxSurge: 100%
  serviceAccountIssuerDiscovery:
    discoveryStore: s3://infra-eu-west-1-discovery
    enableAWSOIDCProvider: true
  sshAccess:
  - 10.0.0.0/8
  subnets:
  - cidr: 10.0.128.0/22
    id: subnet-123
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.0.132.0/22
    id: subnet-123
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: 10.0.136.0/22
    id: subnet-132
    name: eu-west-1c
    type: Private
    zone: eu-west-1c
  - cidr: 10.0.140.0/24
    id: subnet-1123
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  - cidr: 10.0.141.0/24
    id: subnet-132
    name: utility-eu-west-1b
    type: Utility
    zone: eu-west-1b
  - cidr: 10.0.142.0/24
    id: subnet-123
    name: utility-eu-west-1c
    type: Utility
    zone: eu-west-1c
  topology:
    dns:
      type: Public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-10-02T10:12:50Z"
  labels:
    kops.k8s.io/cluster: develop.company.com
  name: master-eu-west-1a
spec:
  additionalSecurityGroups:
  - sg-1234
  cloudLabels:
    k8s.io/cluster-autoscaler/develop.company.com: ""
    k8s.io/cluster-autoscaler/disabled: ""
    k8s.io/cluster-autoscaler/master-template/label: ""
  image: ami-09634b5569ee59efb
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: masters
    kops.k8s.io/spotinstance: "false"
    on-demand: "true"
  role: Master
  rootVolumeType: gp3
  subnets:
  - eu-west-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-10-02T10:12:50Z"
  labels:
    kops.k8s.io/cluster: develop.company.com
  name: master-eu-west-1b
spec:
  additionalSecurityGroups:
  - sg-123
  cloudLabels:
    k8s.io/cluster-autoscaler/develop.company.com: ""
    k8s.io/cluster-autoscaler/disabled: ""
    k8s.io/cluster-autoscaler/master-template/label: ""
  image: ami-09634b5569ee59efb
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: masters
    kops.k8s.io/spotinstance: "false"
    on-demand: "true"
  role: Master
  rootVolumeType: gp3
  subnets:
  - eu-west-1b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-10-02T10:12:51Z"
  labels:
    kops.k8s.io/cluster: develop.company.com
  name: master-eu-west-1c
spec:
  additionalSecurityGroups:
  - sg-123
  cloudLabels:
    k8s.io/cluster-autoscaler/develop.company.com: ""
    k8s.io/cluster-autoscaler/disabled: ""
    k8s.io/cluster-autoscaler/master-template/label: ""
  image: ami-09634b5569ee59efb
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: masters
    kops.k8s.io/spotinstance: "false"
    on-demand: "true"
  role: Master
  rootVolumeType: gp3
  subnets:
  - eu-west-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-10-02T10:12:51Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: develop.company.com
  name: workers-app
spec:
  additionalSecurityGroups:
  - sg-132
  - sg-3322
  additionalUserData:
  - content: |
      #!/bin/bash
      echo "Starting additionalUserData"
      echo "This script will execute before nodeup.sh because cloud-init executes scripts in alphabetic order by name"
      export DEBIAN_FRONTEND=noninteractive
      apt-get update
      # Install some tools
      apt install -y nfs-common   # Required to make EFS volume mount
      apt install -y containerd   # Required for nerdctl to work, container not installed until nodeup runs
      echo $(containerd --version)
      wget https://github.com/containerd/nerdctl/releases/download/v1.7.2/nerdctl-1.7.2-linux-amd64.tar.gz -O /tmp/nerdctl.tar.gz
      tar -C /usr/local/bin/ -xzf /tmp/nerdctl.tar.gz
      echo $(nerdctl version)
      apt install -y awscli
      echo $(aws --version)
      # Get some temporary aws ecr credentials
      DOCKER_PASSWORD=$(aws ecr get-login-password --region eu-west-1)
      DOCKER_USER=AWS
      DOCKER_REGISTRY=1234.dkr.ecr.eu-west-1.amazonaws.com
      PASSWD=$(echo "$DOCKER_USER:$DOCKER_PASSWORD" | tr -d '\n' | base64 -i -w 0)
      CONFIG="\
        {\n
            \"auths\": {\n
                \"$DOCKER_REGISTRY\": {\n
                    \"auth\": \"$PASSWD\"\n
                }\n
            }\n
        }\n"
      mkdir -p ~/.docker
      printf "$CONFIG" > ~/.docker/config.json
      echo "Pulling images from ECR"
      nerdctl pull --namespace k8s.io 1234.dkr.ecr.eu-west-1.amazonaws.com/fluent-bit:2.2.2
      nerdctl pull --namespace k8s.io 1234.dkr.ecr.eu-west-1.amazonaws.com/nginx-prometheus-exporter:0.9.0
      nerdctl pull --namespace k8s.io public.ecr.aws/1234545/dns/k8s-dns-node-cache:1.23.0
      nerdctl pull --namespace k8s.io public.ecr.aws/1234545/amazon-k8s-cni-init:v1.18.1
      nerdctl pull --namespace k8s.io public.ecr.aws/1234545/amazon-k8s-cni:v1.18.1
      nerdctl pull --namespace k8s.io public.ecr.aws/1234545/kube-proxy:v1.28.11
      nerdctl pull --namespace k8s.io public.ecr.aws/1234545/ebs-csi-driver/aws-ebs-csi-driver:v1.30.0
      nerdctl pull --namespace k8s.io public.ecr.aws/1234545/eks-distro/kubernetes-csi/node-driver-registrar:v2.10.0-eks-1-29-5
      nerdctl pull --namespace k8s.io public.ecr.aws/1234545/kubernetes-csi/livenessprobe:v2.12.0-eks-1-29-5
      echo "Remove and unmask containerd so it can be reinstalled by nodeup and configured how it wants it."
      apt remove -y containerd
      systemctl unmask containerd
      echo "Finishing additionalUserData"
    name: all-images.sh
    type: text/x-shellscript
  cloudLabels:
    k8s.io/cluster-autoscaler/develop.company.com: ""
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/node-template/label: ""
  image: ami-09634b5569ee59efb
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: c5.18xlarge
  maxSize: 10
  minSize: 1
  nodeLabels:
    Environment: company-develop
    Group: company-develop-app
    Name: company-develop-infra-app
    Service: company
    kops.k8s.io/instancegroup: workers-app
    kops.k8s.io/spotinstance: "false"
    on-demand: "true"
  role: Node
  rootVolumeType: gp3
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c
  suspendProcesses:
  - AZRebalance
  warmPool:
    enableLifecycleHook: true
    maxSize: 10
    minSize: 5

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

denihot avatar Oct 02 '24 13:10 denihot

Hi,

I attempted to troubleshoot the issue by performing the following steps:

  • Disabling warmpools and then re-enabling them, but unfortunately, the issue persisted.
  • Upgrading Kops to version 1.30.1 and k8s to version 1.30.2, yet the problem persisted.
  • Removing the additionalUserData scripts did not resolve the issue either.

denihot avatar Oct 07 '24 13:10 denihot

We have the same issue!!!

aramhakobyan avatar Oct 21 '24 14:10 aramhakobyan

Hi @hakman, @johngmyers

sorry for the direct message, just last time you helped to solve the issue quickly :).

We are heavily relying on Kops(having 40+ clusters) and using Warmpool. In the recent releases of 1.29 the Warpools have been changed with the following PRs, which brought the mentioned issue.

  • https://github.com/kubernetes/kops/pull/16603
  • https://github.com/kubernetes/kops/pull/15848

Would appreciate to take a look and fix them! If there is any way we can support you in making it happen quickly, please let us know.

aramhakobyan avatar Nov 04 '24 15:11 aramhakobyan

Any update ?

aramhakobyan avatar Dec 03 '24 12:12 aramhakobyan

Can you SSH into an instance that is still Warming and dump the logs from journalctl -u kops-configuration ?

It could be related to https://github.com/kubernetes/kops/pull/16213 or https://github.com/kubernetes/kops/pull/16460/files#diff-0e14cc1cc6d0d21dacab069a7fe628d8c3fc3287a0fb3ad4468194d613a88a5e

rifelpet avatar Dec 04 '24 02:12 rifelpet

Hi @rifelpet,

Thank you for the reply, you can find the log file in the attachment.

Best regards, Deni kops-configuration.log

denihot avatar Dec 05 '24 11:12 denihot

based on your logs, nodeup is definitely skipping the warmpool logic.

Just to confirm, can you run this on an instance that is still Warming and paste its output here?

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") 
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/autoscaling/target-lifecycle-state

rifelpet avatar Dec 05 '24 23:12 rifelpet

Hi @rifelpet,

Here is the output from the command that you sent:

root@ip-10-22-216-163:~# TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") root@ip-10-22-216-163:~# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/autoscaling/target-lifecycle-state Warmed:Stopped

Also I am attaching the kops-configuration logs from that machine for a reference.

It does say this at the end of the kops-configuration log: Dec 06 13:39:21 ip-10-22-216-163 nodeup[2247]: I1206 13:39:21.834281 2247 command.go:422] Found ASG lifecycle hook Dec 06 13:39:21 ip-10-22-216-163 nodeup[2247]: I1206 13:39:21.986849 2247 command.go:432] Lifecycle action completed

After that the machine is powered off, but it still stays in kubernetes cluster:

kubectl get nodes -owide | grep i-067f5984f7c86246c
i-067f5984f7c86246c   NotReady,SchedulingDisabled   node            15m     v1.30.2   10.22.216.163   <none>        Ubuntu 20.04.6 LTS   5.15.0-1068-aws   containerd://1.7.16`

Best regards, Deni

kops-configuration-10-22-216-163.log

denihot avatar Dec 06 '24 13:12 denihot

I believe I know what the issue is, can you test a kops build from this PR?

If you can run the kops CLI on linux amd64, download the kops binary from here:

https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-amazonvpc/pull-a01e7b806b94881c0300d745349d3ee3254f72b6/1.31.0-beta.2+v1.31.0-beta.1-14-gec9fc7223a/linux/amd64/kops

Otherwise you'll need to checkout the branch, run make kops and use the kops binary built in .build/dist.

Set this environment variable:

export KOPS_BASE_URL="https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-amazonvpc/pull-a01e7b806b94881c0300d745349d3ee3254f72b6/1.31.0-beta.2+v1.31.0-beta.1-14-gec9fc7223a"

Then run your normal ./kops update cluster --yes and ./kops rolling-update cluster --yes commands using the custom kops cli build. If this fixes the issue then we can merge and backport this for the next patch releases.

rifelpet avatar Dec 19 '24 02:12 rifelpet

@rifelpet any chance you could rebase that PR on 1.31 stable? I'll give it a try.

jValdron avatar Feb 03 '25 12:02 jValdron

@jValdron sure thing, try this: https://github.com/kubernetes/kops/pull/17249

Heres the linux amd64 binary, or build your own from source:

https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-cilium-1-31/pull-d7454eb7cf8586042e5c36c19ce0fbb6de3629da/1.31.1+v1.31.0-4-gda60162a08/linux/amd64/kops

and set this env var:

KOPS_BASE_URL="https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-cilium-1-31/pull-d7454eb7cf8586042e5c36c19ce0fbb6de3629da/1.31.1+v1.31.0-4-gda60162a08"

rifelpet avatar Feb 04 '25 14:02 rifelpet

Alright, I might not have the best environment to try this in. We use ECR pull through cache, so we end up re-writing the images being pulled by kOps, so the nodeup config ends up with this for the warm pool images:

warmPoolImages:
- <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/cilium/cilium:v1.16.5
- <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/cilium/hubble-relay:v1.16.5
- <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/cilium/operator:v1.16.5
- <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/kube-proxy:v1.30.8
- <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/provider-aws/cloud-controller-manager:v1.30.3

However, ECR requires authentication, so kops-configuration fails with:

Feb 07 15:11:36 <instance ID> nodeup[1515]: ctr: failed to resolve reference "<Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/cilium/hubble-relay:v1.16.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials
Feb 07 15:11:36 <instance ID> nodeup[1515]: W0207 15:11:36.630561    1515 executor.go:141] error running task "PullImageTask/<Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/kube-proxy:v1.30.8" (3m5s remaining to succeed): error pulling docker image with 'ctr --namespace k8s.io images pull <Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/kube-proxy:v1.30.8': exit status 1: time="2025-02-07T15:11:36Z" level=info msg="trying next host" error="pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials" host=<Account ID>.dkr.ecr.us-east-1.amazonaws.com
Feb 07 15:11:36 <instance ID> nodeup[1515]: ctr: failed to resolve reference "<Account ID>.dkr.ecr.us-east-1.amazonaws.com/k8s/kube-proxy:v1.30.8": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

Seems like warmPoolImages might be a new addition? Didn't run into this issue before. Is there a way to disable that functionality?

We currently pull images in a custom user data using something similar to:

content: |
  #!/bin/bash
  set -o errexit
  set -o nounset
  set -o pipefail
  echo "installing aws cli"
  apt install awscli -y
  echo "gathering credentials"
  PASSWORD=$(aws ecr get-login-password --region us-east-1)
  echo "pulling images"
  ctr -n k8s.io image pull <Whatever ECR image>

jValdron avatar Feb 07 '25 15:02 jValdron

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 08 '25 15:05 k8s-triage-robot

@rifelpet - do you still need the reproduction of the issue? or do you have everything for the fix?

aramhakobyan avatar May 09 '25 09:05 aramhakobyan

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 08 '25 09:06 k8s-triage-robot

@jValdron - how can we help you to debug/troubleshoot the issue, so you can provide a fix ?

aramhakobyan avatar Jun 08 '25 20:06 aramhakobyan

@jValdron - how can we help you to debug/troubleshoot the issue, so you can provide a fix ?

@aramhakobyan in order to test this in my clusters, I would require authentication to ECR with warmPoolImages (probably through IAM instance role in my case). Probably https://github.com/kubernetes/kops/issues/12916.

jValdron avatar Jun 10 '25 10:06 jValdron

@aramhakobyan yes if you can confirm whether the custom kops build fixes the problem, that would be appreciated. Heres the updated URLs:

# kops linux/amd64 CLI
https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-amazonvpc/pull-42d12644a9c36d3fa3c2c62aaa633ba2eb6e7532/1.33.0-alpha.2+v1.33.0-alpha.1-45-g98a527d703/linux/amd64/kops

# set this for the kops commands
export KOPS_BASE_URL=https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-amazonvpc/pull-42d12644a9c36d3fa3c2c62aaa633ba2eb6e7532/1.33.0-alpha.2+v1.33.0-alpha.1-45-g98a527d703

rifelpet avatar Jun 12 '25 03:06 rifelpet

@rifelpet - thanks, then we will do it within a week and come back to you!

aramhakobyan avatar Jul 11 '25 08:07 aramhakobyan

@rifelpet

We can not create a cluster with a fixed warmpool kops build. kops validate does not pass on control plane nodes and in the journalctl logs, it complains about networking and csi driver.

12:26:01  VALIDATION ERRORS
12:26:01  KIND	NAME			MESSAGE
12:26:01  Node	i-031e5312a64bf4587	node "i-031e5312a64bf4587" of role "control-plane" is not ready
12:26:01  Node	i-068d438e0a4019ba8	node "i-068d438e0a4019ba8" of role "control-plane" is not ready
12:26:01  Node	i-0cdcc80aa6f48df8a	node "i-0cdcc80aa6f48df8a" of role "control-plane" is not ready
12:26:01  
12:26:01  Validation Failed
12:26:01  W0711 12:25:59.304786    2344 validate_cluster.go:238] (will retry): cluster not yet healthy

Jul 11 10:22:44 ip-10-151-29-63 kubelet[3274]: E0711 10:22:44.736898 3274 kubelet.go:2902] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

aramhakobyan avatar Jul 11 '25 13:07 aramhakobyan

@rifelpet - could you please update us, if you closed the issue because it was solved via #17144 ?

aramhakobyan avatar Jul 17 '25 12:07 aramhakobyan

Yes I believe that fixed the issue

rifelpet avatar Jul 18 '25 21:07 rifelpet