extensions Gvisor pod cannot be terminated properly

The Gvisor test pod used in talos e2e-extensions test never terminates succesfully, this causes the reboot/shutdown sequence to hang and eventually timeout, the kubelet shows failed to delete pod sandbox error. Gvisor test is going to be disabled until this is addressed.

Jun 18 '24 13:06 frezbo

Ref: https://github.com/siderolabs/talos/pull/8905

Jun 18 '24 13:06 frezbo

Upstream issue: https://github.com/google/gvisor/issues/9834#issuecomment-2186806979

Jun 24 '24 15:06 frezbo

Just hit this after upgrading to Talos 1.8.0

Sep 28 '24 15:09 SISheogorath

Also have been experiencing this

Oct 13 '24 20:10 BobyMCbobs

Gvisor is still broken with talos main

 Warning  FailedKillPod           17s    kubelet            error killing pod: failed to "KillPodSandbox" for "01ee1caf-9da0-40af-a663-5408d37d8a0e" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Oct 16 '24 07:10 frezbo

Can you try with gvisor-debug and debug containerd logs so that we can capture more?

Oct 16 '24 09:10 smira

Seems when adding gvisor debug it's still using the runsc.toml from gvisor instead of gvisor-debug:

❯ talosctl -n 10.5.0.3 read /etc/cri/conf.d/gvisor-debug.part
[debug]
  level = "debug"
[plugins."io.containerd.runtime.v1.linux"]
  shim_debug = true

❯ talosctl -n 10.5.0.3 read /etc/cri/conf.d/runsc.toml 
[runsc_config]

❯ talosctl -n 10.5.0.3 get extensions
WARNING: 10.5.0.3: server version 1.8.0-alpha.2-70-ga9bff3a1d-dirty is older than client version 1.8.1
NODE       NAMESPACE   TYPE              ID   VERSION   NAME           VERSION
10.5.0.3   runtime     ExtensionStatus   0    1         gvisor-debug   v1.0.0
10.5.0.3   runtime     ExtensionStatus   1    1         gvisor         20240826.0

Oct 16 '24 10:10 frezbo

not sure why

Oct 16 '24 10:10 frezbo

I wonder if that's the order of extensions?

Oct 16 '24 10:10 smira

I think we should integrate gvisor debug with the general gvisor extension and just add them as additional runtimes.

They remain unusable unless someone configured a runtimeclass for debugging and help to reduce the overhead we see here right now.

Oct 16 '24 10:10 SISheogorath

attaching support zip and runsc logs support.zip runsc.tar.gz

Oct 16 '24 15:10 frezbo

I don't see any errors in the logs you posted so far.

Oct 17 '24 13:10 smira

yeh, that's the thing, it's just the pod fails to terminate

Oct 17 '24 15:10 frezbo

I'm quite sure it's a containerd vs gvisor-shim problem.

Given how many breaking changes containerd v2 introduced in that space:

https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#whats-breaking

~~It's probably broken from here: https://github.com/google/gvisor/blob/abe38d82ac3634264608259d1c60003cdd53658a/shim/cli/cli.go#L27~~

~~As it's called out in containerd v2 as removed here: https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#iocontainerdruntimev1linux-and-iocontainerdruncv1-have-been-removed~~

Oct 17 '24 17:10 SISheogorath

I'm quite sure it's a containerd vs gvisor-shim problem.

Given how many breaking changes containerd v2 introduced in that space:

https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#whats-breaking

It's probably broken from here: https://github.com/google/gvisor/blob/abe38d82ac3634264608259d1c60003cdd53658a/shim/cli/cli.go#L27

As it's called out in containerd v2 as removed here: https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#iocontainerdruntimev1linux-and-iocontainerdruncv1-have-been-removed

would you like to create an upstream issue then?

Oct 18 '24 04:10 frezbo

I think containerd removed it's own runc.v1 shim, totally unrelated to gvisor, but still there might some issue of course.

Oct 18 '24 11:10 smira

containerd issue: https://github.com/containerd/containerd/issues/10891

Oct 24 '24 14:10 frezbo

New gvisor issue here: https://github.com/google/gvisor/issues/11308

Dec 18 '24 14:12 frezbo

I'm quite sure it's a containerd vs gvisor-shim problem. Given how many breaking changes containerd v2 introduced in that space:

@SISheogorath @smira I saw that containerd v2.0.1 was released just 5 days back: https://github.com/containerd/containerd/releases/tag/v2.0.1.

Have you been on containerd v2 from before that? From your investigation in https://github.com/google/gvisor/issues/11308#issuecomment-2552449875, you intuition does feel correct. Something at the shim level is misbehaving (i.e. the shim is not being invoked like its expecting to be).

Dec 18 '24 23:12 ayushr2

@ayushr2 Yes, starting from Talos Linux v1.8.0 containerd v2 was used, first RCs, with v1.8.3 containerd v2 became stable.

Dec 19 '24 03:12 SISheogorath

until a solution is available, is there a workaround to clean up those resources on talos's end?

Jan 03 '25 11:01 xyhhx

until a solution is available, is there a workaround to clean up those resources on talos's end?

not really, triggering a reboot would clean them up as talos will forcefully remove the pods, there's support for containerd v2 coming from gvisor side

Jan 03 '25 12:01 frezbo

sadge

Jan 03 '25 16:01 xyhhx

looks like this is the issue to watch re: gvisor supporting containerd v2: https://github.com/google/gvisor/issues/11319

Jan 10 '25 02:01 xyhhx

The issue was fixes long time ago via https://github.com/siderolabs/talos/issues/10681

Jul 29 '25 18:07 smira

extensions extensions copied to clipboard

Gvisor pod cannot be terminated properly

extensions
extensions copied to clipboard