extensions icon indicating copy to clipboard operation
extensions copied to clipboard

Gvisor pod cannot be terminated properly

Open frezbo opened this issue 1 year ago • 24 comments

The Gvisor test pod used in talos e2e-extensions test never terminates succesfully, this causes the reboot/shutdown sequence to hang and eventually timeout, the kubelet shows failed to delete pod sandbox error. Gvisor test is going to be disabled until this is addressed.

frezbo avatar Jun 18 '24 13:06 frezbo

Ref: https://github.com/siderolabs/talos/pull/8905

frezbo avatar Jun 18 '24 13:06 frezbo

Upstream issue: https://github.com/google/gvisor/issues/9834#issuecomment-2186806979

frezbo avatar Jun 24 '24 15:06 frezbo

Just hit this after upgrading to Talos 1.8.0

SISheogorath avatar Sep 28 '24 15:09 SISheogorath

Also have been experiencing this

BobyMCbobs avatar Oct 13 '24 20:10 BobyMCbobs

Gvisor is still broken with talos main

 Warning  FailedKillPod           17s    kubelet            error killing pod: failed to "KillPodSandbox" for "01ee1caf-9da0-40af-a663-5408d37d8a0e" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"

frezbo avatar Oct 16 '24 07:10 frezbo

Can you try with gvisor-debug and debug containerd logs so that we can capture more?

smira avatar Oct 16 '24 09:10 smira

Seems when adding gvisor debug it's still using the runsc.toml from gvisor instead of gvisor-debug:

❯ talosctl -n 10.5.0.3 read /etc/cri/conf.d/gvisor-debug.part
[debug]
  level = "debug"
[plugins."io.containerd.runtime.v1.linux"]
  shim_debug = true

❯ talosctl -n 10.5.0.3 read /etc/cri/conf.d/runsc.toml 
[runsc_config]

❯ talosctl -n 10.5.0.3 get extensions
WARNING: 10.5.0.3: server version 1.8.0-alpha.2-70-ga9bff3a1d-dirty is older than client version 1.8.1
NODE       NAMESPACE   TYPE              ID   VERSION   NAME           VERSION
10.5.0.3   runtime     ExtensionStatus   0    1         gvisor-debug   v1.0.0
10.5.0.3   runtime     ExtensionStatus   1    1         gvisor         20240826.0

frezbo avatar Oct 16 '24 10:10 frezbo

not sure why

frezbo avatar Oct 16 '24 10:10 frezbo

I wonder if that's the order of extensions?

smira avatar Oct 16 '24 10:10 smira

I think we should integrate gvisor debug with the general gvisor extension and just add them as additional runtimes.

They remain unusable unless someone configured a runtimeclass for debugging and help to reduce the overhead we see here right now.

SISheogorath avatar Oct 16 '24 10:10 SISheogorath

attaching support zip and runsc logs support.zip runsc.tar.gz

frezbo avatar Oct 16 '24 15:10 frezbo

I don't see any errors in the logs you posted so far.

smira avatar Oct 17 '24 13:10 smira

yeh, that's the thing, it's just the pod fails to terminate

frezbo avatar Oct 17 '24 15:10 frezbo

I'm quite sure it's a containerd vs gvisor-shim problem.

Given how many breaking changes containerd v2 introduced in that space:

https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#whats-breaking

~~It's probably broken from here: https://github.com/google/gvisor/blob/abe38d82ac3634264608259d1c60003cdd53658a/shim/cli/cli.go#L27~~

~~As it's called out in containerd v2 as removed here: https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#iocontainerdruntimev1linux-and-iocontainerdruncv1-have-been-removed~~

SISheogorath avatar Oct 17 '24 17:10 SISheogorath

I'm quite sure it's a containerd vs gvisor-shim problem.

Given how many breaking changes containerd v2 introduced in that space:

https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#whats-breaking

It's probably broken from here: https://github.com/google/gvisor/blob/abe38d82ac3634264608259d1c60003cdd53658a/shim/cli/cli.go#L27

As it's called out in containerd v2 as removed here: https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#iocontainerdruntimev1linux-and-iocontainerdruncv1-have-been-removed

would you like to create an upstream issue then?

frezbo avatar Oct 18 '24 04:10 frezbo

I think containerd removed it's own runc.v1 shim, totally unrelated to gvisor, but still there might some issue of course.

smira avatar Oct 18 '24 11:10 smira

containerd issue: https://github.com/containerd/containerd/issues/10891

frezbo avatar Oct 24 '24 14:10 frezbo

New gvisor issue here: https://github.com/google/gvisor/issues/11308

frezbo avatar Dec 18 '24 14:12 frezbo

I'm quite sure it's a containerd vs gvisor-shim problem. Given how many breaking changes containerd v2 introduced in that space:

@SISheogorath @smira I saw that containerd v2.0.1 was released just 5 days back: https://github.com/containerd/containerd/releases/tag/v2.0.1.

Have you been on containerd v2 from before that? From your investigation in https://github.com/google/gvisor/issues/11308#issuecomment-2552449875, you intuition does feel correct. Something at the shim level is misbehaving (i.e. the shim is not being invoked like its expecting to be).

ayushr2 avatar Dec 18 '24 23:12 ayushr2

@ayushr2 Yes, starting from Talos Linux v1.8.0 containerd v2 was used, first RCs, with v1.8.3 containerd v2 became stable.

SISheogorath avatar Dec 19 '24 03:12 SISheogorath

until a solution is available, is there a workaround to clean up those resources on talos's end?

xyhhx avatar Jan 03 '25 11:01 xyhhx

until a solution is available, is there a workaround to clean up those resources on talos's end?

not really, triggering a reboot would clean them up as talos will forcefully remove the pods, there's support for containerd v2 coming from gvisor side

frezbo avatar Jan 03 '25 12:01 frezbo

sadge

xyhhx avatar Jan 03 '25 16:01 xyhhx

looks like this is the issue to watch re: gvisor supporting containerd v2: https://github.com/google/gvisor/issues/11319

xyhhx avatar Jan 10 '25 02:01 xyhhx

The issue was fixes long time ago via https://github.com/siderolabs/talos/issues/10681

smira avatar Jul 29 '25 18:07 smira