etcd-manager fails to start after kOps upgrade 1.33.1 → 1.34.1
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
Before upgrade:
Client version: 1.34.1
Last applied server version: 1.33.1
After failed upgrade:
Client version: 1.34.1
Last applied server version: 1.34.1
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
| Component | Version |
|---|---|
| Kubernetes | 1.34.2 |
| OS: Flatcar Stable | 4230.2.4 |
| containerd | v1.7.23 |
| etcd | 3.5.21 |
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
kops update cluster --yes
kops rolling-update cluster --instance-group master --control-plane-interval 1s --cloudonly --yes
kops validate cluster --wait 15m
5. What happened after the commands executed?
The control plane never comes up because etcd-manager fails to start.
6. What did you expect to happen?
The control plane should come up again.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
Cluster manifest
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
generation: 1
name: dev.k8s.local
spec:
DisableSubnetTags: true
api:
loadBalancer:
class: Network
idleTimeoutSeconds: 3600
subnets:
- name: dev-internal
type: Internal
authentication:
aws:
image: public.ecr.aws/eks-distro/kubernetes-sigs/aws-iam-authenticator:v0.7.7-eks-1-32-28
authorization:
rbac: {}
certManager:
enabled: true
hostedZoneIDs:
- <redacted>
- <redacted>
channel: stable
cloudProvider: aws
clusterAutoscaler:
cpuRequest: 100m
enabled: true
expander: least-waste
memoryRequest: 300Mi
configBase: s3://<redacted>-dev-kops-state/dev.k8s.local
etcdClusters:
- etcdMembers:
- instanceGroup: master
name: a
name: main
version: 3.5.21
- etcdMembers:
- instanceGroup: master
name: a
name: events
version: 3.5.21
externalPolicies:
master:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
node:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
fileAssets:
- content: |
[Socket]
ListenStream=
ListenStream=127.0.0.1:22
FreeBind=true
mode: "0644"
name: sshd-restrict
path: /etc/systemd/system/sshd.socket.d/10-sshd-restrict.conf
- content: |
apiVersion: v1
kind: Config
clusters:
- name: audit.dev.k8s.local
cluster:
server: http://:30009/k8s-audit
contexts:
- name: webhook
context:
cluster: audit.dev.k8s.local
user: ""
current-context: webhook
preferences: {}
users: []
name: audit-webhook-config
path: /etc/kubernetes/audit/webhook-config.yaml
roles:
- ControlPlane
- content: |
apiVersion: audit.k8s.io/v1 # This is required.
kind: Policy
# Don't generate audit events for all requests in RequestReceived stage.
omitStages:
- "RequestReceived"
rules:
# Log pod changes at RequestResponse level
- level: RequestResponse
resources:
- group: ""
# Resource "pods" doesn't match requests to any subresource of pods,
# which is consistent with the RBAC policy.
resources: ["pods", "deployments"]
- level: RequestResponse
resources:
- group: "rbac.authorization.k8s.io"
# Resource "pods" doesn't match requests to any subresource of pods,
# which is consistent with the RBAC policy.
resources: ["clusterroles", "clusterrolebindings"]
# Log "pods/log", "pods/status" at Metadata level
- level: Metadata
resources:
- group: ""
resources: ["pods/log", "pods/status"]
# Don't log requests to a configmap called "controller-leader"
- level: None
resources:
- group: ""
resources: ["configmaps"]
resourceNames: ["controller-leader"]
# Don't log watch requests by the "system:kube-proxy" on endpoints or services
- level: None
users: ["system:kube-proxy"]
verbs: ["watch"]
resources:
- group: "" # core API group
resources: ["endpoints", "services"]
# Don't log authenticated requests to certain non-resource URL paths.
- level: None
userGroups: ["system:authenticated"]
nonResourceURLs:
- "/api*" # Wildcard matching.
- "/version"
# Log the request body of configmap changes in kube-system.
- level: Request
resources:
- group: "" # core API group
resources: ["configmaps"]
# This rule only applies to resources in the "kube-system" namespace.
# The empty string "" can be used to select non-namespaced resources.
namespaces: ["kube-system"]
# Log configmap changes in all other namespaces at the RequestResponse level.
- level: RequestResponse
resources:
- group: "" # core API group
resources: ["configmaps"]
# Log secret changes in all other namespaces at the Metadata level.
- level: Metadata
resources:
- group: "" # core API group
resources: ["secrets"]
# Log all other resources in core and extensions at the Request level.
- level: Request
resources:
- group: "" # core API group
- group: "extensions" # Version of group should NOT be included.
# A catch-all rule to log all other requests at the Metadata level.
- level: Metadata
# Long-running requests like watches that fall under this rule will not
# generate an audit event in RequestReceived.
omitStages:
- "RequestReceived"
name: audit-policy-config
path: /etc/kubernetes/audit/policy-config.yaml
roles:
- ControlPlane
hooks:
- before:
- update-engine.service
manifest: |
Type=oneshot
ExecStartPre=/usr/bin/systemctl mask --now update-engine.service
ExecStartPre=/usr/bin/systemctl mask --now locksmithd.service
ExecStart=/usr/bin/systemctl reset-failed update-engine.service
name: disable-automatic-updates.service
- manifest: |
Type=oneshot
# Prune all unused docker images older than 7 days
ExecStart=/usr/bin/docker system prune -af --filter "until=168h"
name: docker-prune.service
requires:
- docker.service
- manifest: |
[Unit]
Description=Prune docker daily
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
name: docker-prune.timer
useRawManifest: true
- before:
- protokube.service
manifest: |-
Type=oneshot
ExecStart=/usr/bin/systemctl restart sshd.socket
name: sshd-socket-restart.service
iam:
allowContainerRegistry: true
legacy: false
serviceAccountExternalPermissions:
<redacted>>
useServiceAccountExternalPermissions: true
kubeAPIServer:
auditPolicyFile: /etc/kubernetes/audit/policy-config.yaml
auditWebhookBatchMaxWait: 5s
auditWebhookConfigFile: /etc/kubernetes/audit/webhook-config.yaml
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
hostnameOverride: '@hostname'
kubernetesApiAccess:
- <redacted>>
kubernetesVersion: 1.34.2
masterPublicName: api.dev.k8s.local
networkCIDR: <redacted>>
networkID: vpc-3d03f75b
networking:
cilium:
enableL7Proxy: true
hubble:
enabled: true
nonMasqueradeCIDR: 100.64.0.0/10
podIdentityWebhook:
enabled: true
rollingUpdate:
maxSurge: 2
serviceAccountIssuerDiscovery:
discoveryStore: <redacted>
enableAWSOIDCProvider: true
snapshotController:
enabled: true
sshKeyName: dev
subnets:
- egress: <redacted>
id: <redacted>
name: dev-internal
type: Private
zone: eu-west-1a
- egress: <redacted>
id: <redacted>
name: dev-private
type: Private
zone: eu-west-1a
- id: <redacted>
name: dev-public
type: Utility
zone: eu-west-1a
topology:
dns:
type: None
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2025-11-22T21:21:09Z"
labels:
kops.k8s.io/cluster: dev.k8s.local
name: master
spec:
autoscale: false
image: ami-02d94ae5d4360b407
instanceMetadata:
httpPutResponseHopLimit: 1
httpTokens: required
machineType: t4g.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master
tier: master
role: Master
rootVolumeSize: 16
rootVolumeType: gp3
subnets:
- dev-internal
---
<redacted>
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Here are the logs of kubelet and containerd on the master node (journactl -u kubelet -u containerd > fail.log): fail.log
I notice that the container does not even start (it does not even appear in crictl ps -a). Here are the logs from when it tries to start the container:
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.740709278Z" level=info msg="ImageCreate event name:\"registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.789435941Z" level=info msg="stop pulling image registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d: active requests=0, bytes read=85783858"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.833518870Z" level=info msg="ImageCreate event name:\"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.835087313Z" level=info msg="Pulled image \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" with image id \"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\", repo tag \"\", repo digest \"registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\", size \"85782882\" in 4.969761125s"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.835133016Z" level=info msg="PullImage \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" returns image reference \"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\""
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.836774296Z" level=info msg="PullImage \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\""
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.873277155Z" level=info msg="CreateContainer within sandbox \"d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775\" for container &ContainerMetadata{Name:etcd-manager,Attempt:0,}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.874584322Z" level=error msg="CreateContainer within sandbox \"d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775\" for &ContainerMetadata{Name:etcd-manager,Attempt:0,} failed" error="failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875156 3936 log.go:32] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory" podSandboxID="d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875294 3936 kuberuntime_manager.go:1449] "Unhandled Error" err="container etcd-manager start failed in pod etcd-manager-events-i-06f1a3baa9bed8ccd_kube-system(abe485e9c356c6883d5536e2e3788153): CreateContainerError: failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory" logger="UnhandledError"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875335 3936 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd-manager\" with CreateContainerError: \"failed to generate container \\\"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\\\" spec: failed to generate spec: failed to mkdir \\\"\\\": mkdir : no such file or directory\"" pod="kube-system/etcd-manager-events-i-06f1a3baa9bed8ccd" podUID="abe485e9c356c6883d5536e2e3788153"
And this is error in the first error line from containerd above:
error="failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\"
spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory"
I've compared a log of a successful startup (kOps 1.33.1) and the logs contain no such error lines.
I should note that the same issue also appears when upgrading to 1.34.0.
Same here.
I think the issue is that Flatcar is using an old version of containerd and runc.
Is there any chance you can try with the latest alpha image, it should contain containerd 2.1.5.
Sure – I've now tested the latest stable Flatcar and the latest alpha Flatcar. It works with the latest alpha. Here's a summary of the results:
| kOps | Flatcar version | containerd version | runc version | etcd-manager starts |
|---|---|---|---|---|
| 1.33.1 | Stable 4230.2.4 | 1.7.23 | 1.1.14 | ✅ |
| 1.34.1 | Stable 4230.2.4 | 1.7.23 | 1.1.14 | ❌ |
| 1.34.1 | Stable 4459.2.1 | 2.0.7 | 1.3.3 | ❌ |
| 1.34.1 | Alpha 4515.0.1 | 2.1.5 | 1.3.3 | ✅ |
I also did a crictl inspectp of the etcd-manager pod on stable 4459.2.1 where it starts to fail. Not sure if it's of any value, but attaching it here: etcd-manager-pod.json.
Thanks @gustav-b, I could confirm it on our end also in the periodic tests. The alpha works fine: https://testgrid.k8s.io/kops-distros#kops-aws-distro-flatcar.
The reason it fails to work with beta and stable is https://github.com/kubernetes/kops/pull/17539 and should be fixed once contained 2.1+ reaches the stable release channel. I am unsure when this is planned.
Thanks!
I see, but I don't quite understand what change made image volume support in containerd a requirement? I thought the kOps code handled both cases?
kOps code doesn't handle the case where containerd is installed as part of the distro. Expectations are that distros keep up with package updates. In this case, container 2.1 has been released for 6 months. The Flatcar team expects the containerd update to be part of the beta channel in December.
I've now tested with latest beta channel release of Flatcar (4515.1.0) released December 18. It has containerd 2.1.5 and everything is working as expected. 👍
Awesome, thanks for confirming!