bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

Restarting dbus results in kubelet being unable to start pods

Open cbgbt opened this issue 2 years ago • 1 comments

Image I'm using: Any Kubernetes variant

What I expected to happen: If the dbus service is restarted, scheduling pods with Kubernetes should not be impacted.

What actually happened: Scheduling a pod does not complete after dbus has been restarted unless the kubelet is also restarted:

unable to ensure pod container exists: failed to create container for [kubepods besteffort ...] : dbus: connection closed by user

How to reproduce the problem:

  • systemctl restart dbus as admin in a node
  • Attempt to schedule pods to this node

Additional Details: This is fixed in runc in https://github.com/opencontainers/runc/pull/3475 and backported to 1.1.x in https://github.com/opencontainers/runc/pull/3476; however, to use this fix, we need our kubernetes packaging to stop using vendored libct and instead get a cached version from our build of runc.

cbgbt avatar Jun 02 '22 19:06 cbgbt

Should be fixed upstream by https://github.com/kubernetes/kubernetes/pull/110496

kolyshkin avatar Jun 09 '22 23:06 kolyshkin

I tested this in k8s 1.23, Bottlerocket 1.12, the problem persists:

Jan 30 22:20:32 ip-192-168-63-75.us-west-2.compute.internal kubelet[2172]: E0130 22:20:32.587216    2172 qos_container_manager_linux.go:375] "Failed to update QoS cgroup configuration" err="dbus: connection closed by user"

arnaldo2792 avatar Jan 30 '23 22:01 arnaldo2792

The problem still persist on all k8s variants except 1.25, because upstream kubernetes only bump runc to latest version on 1.25 version. The rest k8s version still remain on old version runc which cause this problem.

gthao313 avatar Jan 30 '23 22:01 gthao313

I think rather than rebase the vendored runc fix to previous versions, we can close this issue, since the fix should be available in Kubernetes 1.25 variants and beyond.

cbgbt avatar Feb 07 '23 19:02 cbgbt