kops icon indicating copy to clipboard operation
kops copied to clipboard

etcd-manager fails to start after kOps upgrade 1.33.1 → 1.34.1

Open gustav-b opened this issue 4 weeks ago • 7 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information.

Before upgrade:

Client version: 1.34.1
Last applied server version: 1.33.1

After failed upgrade:

Client version: 1.34.1
Last applied server version: 1.34.1

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Component Version
Kubernetes 1.34.2
OS: Flatcar Stable 4230.2.4
containerd v1.7.23
etcd 3.5.21

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops update cluster --yes
kops rolling-update cluster --instance-group master --control-plane-interval 1s --cloudonly --yes
kops validate cluster --wait 15m

5. What happened after the commands executed?

The control plane never comes up because etcd-manager fails to start.

6. What did you expect to happen?

The control plane should come up again.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

Cluster manifest
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  generation: 1
  name: dev.k8s.local
spec:
  DisableSubnetTags: true
  api:
    loadBalancer:
      class: Network
      idleTimeoutSeconds: 3600
      subnets:
      - name: dev-internal
      type: Internal
  authentication:
    aws:
      image: public.ecr.aws/eks-distro/kubernetes-sigs/aws-iam-authenticator:v0.7.7-eks-1-32-28
  authorization:
    rbac: {}
  certManager:
    enabled: true
    hostedZoneIDs:
    - <redacted>
    - <redacted>
  channel: stable
  cloudProvider: aws
  clusterAutoscaler:
    cpuRequest: 100m
    enabled: true
    expander: least-waste
    memoryRequest: 300Mi
  configBase: s3://<redacted>-dev-kops-state/dev.k8s.local
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master
      name: a
    name: main
    version: 3.5.21
  - etcdMembers:
    - instanceGroup: master
      name: a
    name: events
    version: 3.5.21
  externalPolicies:
    master:
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    node:
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  fileAssets:
  - content: |
      [Socket]
      ListenStream=
      ListenStream=127.0.0.1:22
      FreeBind=true
    mode: "0644"
    name: sshd-restrict
    path: /etc/systemd/system/sshd.socket.d/10-sshd-restrict.conf
  - content: |
      apiVersion: v1
      kind: Config
      clusters:
        - name: audit.dev.k8s.local
          cluster:
            server: http://:30009/k8s-audit
      contexts:
        - name: webhook
          context:
            cluster: audit.dev.k8s.local
            user: ""
      current-context: webhook
      preferences: {}
      users: []
    name: audit-webhook-config
    path: /etc/kubernetes/audit/webhook-config.yaml
    roles:
    - ControlPlane
  - content: |
      apiVersion: audit.k8s.io/v1 # This is required.
      kind: Policy
      # Don't generate audit events for all requests in RequestReceived stage.
      omitStages:
        - "RequestReceived"
      rules:
        # Log pod changes at RequestResponse level
        - level: RequestResponse
          resources:
          - group: ""
            # Resource "pods" doesn't match requests to any subresource of pods,
            # which is consistent with the RBAC policy.
            resources: ["pods", "deployments"]
        - level: RequestResponse
          resources:
          - group: "rbac.authorization.k8s.io"
            # Resource "pods" doesn't match requests to any subresource of pods,
            # which is consistent with the RBAC policy.
            resources: ["clusterroles", "clusterrolebindings"]
        # Log "pods/log", "pods/status" at Metadata level
        - level: Metadata
          resources:
          - group: ""
            resources: ["pods/log", "pods/status"]
        # Don't log requests to a configmap called "controller-leader"
        - level: None
          resources:
          - group: ""
            resources: ["configmaps"]
            resourceNames: ["controller-leader"]
        # Don't log watch requests by the "system:kube-proxy" on endpoints or services
        - level: None
          users: ["system:kube-proxy"]
          verbs: ["watch"]
          resources:
          - group: "" # core API group
            resources: ["endpoints", "services"]
        # Don't log authenticated requests to certain non-resource URL paths.
        - level: None
          userGroups: ["system:authenticated"]
          nonResourceURLs:
          - "/api*" # Wildcard matching.
          - "/version"
        # Log the request body of configmap changes in kube-system.
        - level: Request
          resources:
          - group: "" # core API group
            resources: ["configmaps"]
          # This rule only applies to resources in the "kube-system" namespace.
          # The empty string "" can be used to select non-namespaced resources.
          namespaces: ["kube-system"]
        # Log configmap changes in all other namespaces at the RequestResponse level.
        - level: RequestResponse
          resources:
          - group: "" # core API group
            resources: ["configmaps"]
        # Log secret changes in all other namespaces at the Metadata level.
        - level: Metadata
          resources:
          - group: "" # core API group
            resources: ["secrets"]
        # Log all other resources in core and extensions at the Request level.
        - level: Request
          resources:
          - group: "" # core API group
          - group: "extensions" # Version of group should NOT be included.
        # A catch-all rule to log all other requests at the Metadata level.
        - level: Metadata
          # Long-running requests like watches that fall under this rule will not
          # generate an audit event in RequestReceived.
          omitStages:
            - "RequestReceived"
    name: audit-policy-config
    path: /etc/kubernetes/audit/policy-config.yaml
    roles:
    - ControlPlane
  hooks:
  - before:
    - update-engine.service
    manifest: |
      Type=oneshot
      ExecStartPre=/usr/bin/systemctl mask --now update-engine.service
      ExecStartPre=/usr/bin/systemctl mask --now locksmithd.service
      ExecStart=/usr/bin/systemctl reset-failed update-engine.service
    name: disable-automatic-updates.service
  - manifest: |
      Type=oneshot
      # Prune all unused docker images older than 7 days
      ExecStart=/usr/bin/docker system prune -af --filter "until=168h"
    name: docker-prune.service
    requires:
    - docker.service
  - manifest: |
      [Unit]
      Description=Prune docker daily

      [Timer]
      OnCalendar=daily
      Persistent=true

      [Install]
      WantedBy=timers.target
    name: docker-prune.timer
    useRawManifest: true
  - before:
    - protokube.service
    manifest: |-
      Type=oneshot
      ExecStart=/usr/bin/systemctl restart sshd.socket
    name: sshd-socket-restart.service
  iam:
    allowContainerRegistry: true
    legacy: false
    serviceAccountExternalPermissions:
      <redacted>>
    useServiceAccountExternalPermissions: true
  kubeAPIServer:
    auditPolicyFile: /etc/kubernetes/audit/policy-config.yaml
    auditWebhookBatchMaxWait: 5s
    auditWebhookConfigFile: /etc/kubernetes/audit/webhook-config.yaml
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    hostnameOverride: '@hostname'
  kubernetesApiAccess:
  - <redacted>>
  kubernetesVersion: 1.34.2
  masterPublicName: api.dev.k8s.local
  networkCIDR: <redacted>>
  networkID: vpc-3d03f75b
  networking:
    cilium:
      enableL7Proxy: true
      hubble:
        enabled: true
  nonMasqueradeCIDR: 100.64.0.0/10
  podIdentityWebhook:
    enabled: true
  rollingUpdate:
    maxSurge: 2
  serviceAccountIssuerDiscovery:
    discoveryStore: <redacted>
    enableAWSOIDCProvider: true
  snapshotController:
    enabled: true
  sshKeyName: dev
  subnets:
  - egress: <redacted>
    id: <redacted>
    name: dev-internal
    type: Private
    zone: eu-west-1a
  - egress: <redacted>
    id: <redacted>
    name: dev-private
    type: Private
    zone: eu-west-1a
  - id: <redacted>
    name: dev-public
    type: Utility
    zone: eu-west-1a
  topology:
    dns:
      type: None

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2025-11-22T21:21:09Z"
  labels:
    kops.k8s.io/cluster: dev.k8s.local
  name: master
spec:
  autoscale: false
  image: ami-02d94ae5d4360b407
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: t4g.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master
    tier: master
  role: Master
  rootVolumeSize: 16
  rootVolumeType: gp3
  subnets:
  - dev-internal

---

<redacted>

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

Here are the logs of kubelet and containerd on the master node (journactl -u kubelet -u containerd > fail.log): fail.log

I notice that the container does not even start (it does not even appear in crictl ps -a). Here are the logs from when it tries to start the container:

Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.740709278Z" level=info msg="ImageCreate event name:\"registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.789435941Z" level=info msg="stop pulling image registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d: active requests=0, bytes read=85783858"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.833518870Z" level=info msg="ImageCreate event name:\"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.835087313Z" level=info msg="Pulled image \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" with image id \"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\", repo tag \"\", repo digest \"registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\", size \"85782882\" in 4.969761125s"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.835133016Z" level=info msg="PullImage \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" returns image reference \"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\""
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.836774296Z" level=info msg="PullImage \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\""
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.873277155Z" level=info msg="CreateContainer within sandbox \"d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775\" for container &ContainerMetadata{Name:etcd-manager,Attempt:0,}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.874584322Z" level=error msg="CreateContainer within sandbox \"d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775\" for &ContainerMetadata{Name:etcd-manager,Attempt:0,} failed" error="failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875156    3936 log.go:32] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory" podSandboxID="d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875294    3936 kuberuntime_manager.go:1449] "Unhandled Error" err="container etcd-manager start failed in pod etcd-manager-events-i-06f1a3baa9bed8ccd_kube-system(abe485e9c356c6883d5536e2e3788153): CreateContainerError: failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory" logger="UnhandledError"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875335    3936 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd-manager\" with CreateContainerError: \"failed to generate container \\\"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\\\" spec: failed to generate spec: failed to mkdir \\\"\\\": mkdir : no such file or directory\"" pod="kube-system/etcd-manager-events-i-06f1a3baa9bed8ccd" podUID="abe485e9c356c6883d5536e2e3788153"

And this is error in the first error line from containerd above:

error="failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\"
spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory"

I've compared a log of a successful startup (kOps 1.33.1) and the logs contain no such error lines.

gustav-b avatar Nov 28 '25 23:11 gustav-b

I should note that the same issue also appears when upgrading to 1.34.0.

gustav-b avatar Dec 01 '25 06:12 gustav-b

Same here.

kforsthoevel avatar Dec 01 '25 09:12 kforsthoevel

I think the issue is that Flatcar is using an old version of containerd and runc. Is there any chance you can try with the latest alpha image, it should contain containerd 2.1.5.

hakman avatar Dec 01 '25 16:12 hakman

Sure – I've now tested the latest stable Flatcar and the latest alpha Flatcar. It works with the latest alpha. Here's a summary of the results:

kOps Flatcar version containerd version runc version etcd-manager starts
1.33.1 Stable 4230.2.4 1.7.23 1.1.14
1.34.1 Stable 4230.2.4 1.7.23 1.1.14
1.34.1 Stable 4459.2.1 2.0.7 1.3.3
1.34.1 Alpha 4515.0.1 2.1.5 1.3.3

I also did a crictl inspectp of the etcd-manager pod on stable 4459.2.1 where it starts to fail. Not sure if it's of any value, but attaching it here: etcd-manager-pod.json.

gustav-b avatar Dec 01 '25 22:12 gustav-b

Thanks @gustav-b, I could confirm it on our end also in the periodic tests. The alpha works fine: https://testgrid.k8s.io/kops-distros#kops-aws-distro-flatcar. The reason it fails to work with beta and stable is https://github.com/kubernetes/kops/pull/17539 and should be fixed once contained 2.1+ reaches the stable release channel. I am unsure when this is planned.

hakman avatar Dec 02 '25 01:12 hakman

Thanks!

I see, but I don't quite understand what change made image volume support in containerd a requirement? I thought the kOps code handled both cases?

gustav-b avatar Dec 02 '25 10:12 gustav-b

kOps code doesn't handle the case where containerd is installed as part of the distro. Expectations are that distros keep up with package updates. In this case, container 2.1 has been released for 6 months. The Flatcar team expects the containerd update to be part of the beta channel in December.

hakman avatar Dec 02 '25 10:12 hakman

I've now tested with latest beta channel release of Flatcar (4515.1.0) released December 18. It has containerd 2.1.5 and everything is working as expected. 👍

gustav-b avatar Dec 23 '25 12:12 gustav-b

Awesome, thanks for confirming!

hakman avatar Dec 23 '25 13:12 hakman