extensions
extensions copied to clipboard
Gvisor pod cannot be terminated properly
The Gvisor test pod used in talos e2e-extensions test never terminates succesfully, this causes the reboot/shutdown sequence to hang and eventually timeout, the kubelet shows failed to delete pod sandbox error. Gvisor test is going to be disabled until this is addressed.
Ref: https://github.com/siderolabs/talos/pull/8905
Upstream issue: https://github.com/google/gvisor/issues/9834#issuecomment-2186806979
Just hit this after upgrading to Talos 1.8.0
Also have been experiencing this
Gvisor is still broken with talos main
Warning FailedKillPod 17s kubelet error killing pod: failed to "KillPodSandbox" for "01ee1caf-9da0-40af-a663-5408d37d8a0e" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Can you try with gvisor-debug and debug containerd logs so that we can capture more?
Seems when adding gvisor debug it's still using the runsc.toml from gvisor instead of gvisor-debug:
❯ talosctl -n 10.5.0.3 read /etc/cri/conf.d/gvisor-debug.part
[debug]
level = "debug"
[plugins."io.containerd.runtime.v1.linux"]
shim_debug = true
❯ talosctl -n 10.5.0.3 read /etc/cri/conf.d/runsc.toml
[runsc_config]
❯ talosctl -n 10.5.0.3 get extensions
WARNING: 10.5.0.3: server version 1.8.0-alpha.2-70-ga9bff3a1d-dirty is older than client version 1.8.1
NODE NAMESPACE TYPE ID VERSION NAME VERSION
10.5.0.3 runtime ExtensionStatus 0 1 gvisor-debug v1.0.0
10.5.0.3 runtime ExtensionStatus 1 1 gvisor 20240826.0
not sure why
I wonder if that's the order of extensions?
I think we should integrate gvisor debug with the general gvisor extension and just add them as additional runtimes.
They remain unusable unless someone configured a runtimeclass for debugging and help to reduce the overhead we see here right now.
attaching support zip and runsc logs support.zip runsc.tar.gz
I don't see any errors in the logs you posted so far.
yeh, that's the thing, it's just the pod fails to terminate
I'm quite sure it's a containerd vs gvisor-shim problem.
Given how many breaking changes containerd v2 introduced in that space:
https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#whats-breaking
~~It's probably broken from here: https://github.com/google/gvisor/blob/abe38d82ac3634264608259d1c60003cdd53658a/shim/cli/cli.go#L27~~
~~As it's called out in containerd v2 as removed here: https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#iocontainerdruntimev1linux-and-iocontainerdruncv1-have-been-removed~~
I'm quite sure it's a containerd vs gvisor-shim problem.
Given how many breaking changes containerd v2 introduced in that space:
https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#whats-breaking
It's probably broken from here: https://github.com/google/gvisor/blob/abe38d82ac3634264608259d1c60003cdd53658a/shim/cli/cli.go#L27
As it's called out in containerd v2 as removed here: https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#iocontainerdruntimev1linux-and-iocontainerdruncv1-have-been-removed
would you like to create an upstream issue then?
I think containerd removed it's own runc.v1 shim, totally unrelated to gvisor, but still there might some issue of course.
containerd issue: https://github.com/containerd/containerd/issues/10891
New gvisor issue here: https://github.com/google/gvisor/issues/11308
I'm quite sure it's a containerd vs gvisor-shim problem. Given how many breaking changes containerd v2 introduced in that space:
@SISheogorath @smira I saw that containerd v2.0.1 was released just 5 days back: https://github.com/containerd/containerd/releases/tag/v2.0.1.
Have you been on containerd v2 from before that? From your investigation in https://github.com/google/gvisor/issues/11308#issuecomment-2552449875, you intuition does feel correct. Something at the shim level is misbehaving (i.e. the shim is not being invoked like its expecting to be).
@ayushr2 Yes, starting from Talos Linux v1.8.0 containerd v2 was used, first RCs, with v1.8.3 containerd v2 became stable.
until a solution is available, is there a workaround to clean up those resources on talos's end?
until a solution is available, is there a workaround to clean up those resources on talos's end?
not really, triggering a reboot would clean them up as talos will forcefully remove the pods, there's support for containerd v2 coming from gvisor side
sadge
looks like this is the issue to watch re: gvisor supporting containerd v2: https://github.com/google/gvisor/issues/11319
The issue was fixes long time ago via https://github.com/siderolabs/talos/issues/10681