amazon-eks-ami
amazon-eks-ami copied to clipboard
Pods are stuck in a Terminating state on specific nodes
What happened: We have several EKS clusters which autoscale throughout the day - they handle burst workloads, and in a given day the underlying ASG-s may scale down to 0 nodes, or scale up to tens of nodes.
We've noticed that once in a while, we have nodes which have pods stuck in a 'Terminating' status on them for days on-end, until we manually intervene and force-delete them;
In our case, it is especially problematic when this happens with the ebs csi driver pod - as it stops handling volume attachment requests - it holds to bound volumes until our manual intervention and does handle further attachment requests.
I've SSH-ed to one of the nodes which experienced this behavior and tried to introspect the behavior, here's what I've gathered, hopefully covering most abstraction layers, but I could not find the root cause unfortunately - I'd love to know what can I do to further debug this.
Quick summary:
- Kubelet is trying to kill an already dead container indefinitely
- Docker thinks the container is running, although it is not
- Containerd shows the task is stopped
- The container-shim process wasn't reaped and is still alive
The above leads me to conclude that there's something fishy in the distributed orchestration of killing a container, I'd assume that it's somewhere between containerd<->runc, but I'm not entirely sure - and would love to know how can I better pinpoint the exact cause.
Deeper dive per abstraction layer -
Kubernetes / kubelet:
- The pod is stuck in a 'Terminating state'
- There are no finalizers on the pod
- When I describe the pod, I see something similar to 'Normal Killing 2m49s (x1715 over 2d6h) kubelet Stopping container csi-node'
- In journalctl -u kubelet, I see the following near the time the pod started to terminate:
Jul 07 03:40:25 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:40:20.474218 4811 kubelet.go:1848] SyncLoop (DELETE, "api"): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)
"
Jul 07 03:40:32 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:40:32.052286 4811 kubelet.go:1870] SyncLoop (PLEG): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)", event:
&pleg.PodLifecycleEvent{ID:"6f2f36c4-06f0-406d-9681-b92fa0106441", Type:"ContainerDied", Data:"789845b4d5dc620ce62b36ff5a6d2ef380a725226264b0170d33ea645bb837f1"}
Jul 07 03:41:18 ip-10-0-73-87.ec2.internal kubelet[4811]: I0707 03:41:18.375163 4811 kubelet.go:1870] SyncLoop (PLEG): "csi-driver-aws-csi-driver-node-2qssj_default(6f2f36c4-06f0-406d-9681-b92fa0106441)", event:
&pleg.PodLifecycleEvent{ID:"6f2f36c4-06f0-406d-9681-b92fa0106441", Type:"ContainerDied", Data:"650c0009c27bfd04f4578e3a5fe2ce0eea300088acb5e669f1d71c5d187139ff"}
- Then, I see indefinite "killing container" messages:
Jul 09 10:32:23 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:32:23.591613 4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period
Jul 09 10:34:24 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:34:24.591670 4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period
Jul 09 10:36:09 ip-10-0-73-87.ec2.internal kubelet[4811]: I0709 10:36:09.591654 4811 kuberuntime_container.go:635] Killing container "docker://cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a" wi
th a 30 second grace period
- I've uploaded the kubelet stacktrace here - kubelet-stacktrace.txt
Docker:
- When I docker container ls / docker inspect
- the container status is 'Running'
{
"Id": "cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
"Created": "2021-07-07T00:11:27.14530101Z",
"Path": "/entrypoint.sh",
"Args": [
"--endpoint=unix:///csi/csi.sock",
"--v=4",
"--volume-attach-limit=5"
],
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 7064,
"ExitCode": 0,
"Error": "",
"StartedAt": "2021-07-07T00:11:36.641265855Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
...
}
- In journalctl, looking near the time the pod started terminating, I see the following:
-- Logs begin at Wed 2021-07-07 00:10:20 UTC, end at Fri 2021-07-09 10:46:56 UTC. --
Jul 07 03:30:43 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:30:43.252002398Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:30:44 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:30:44.542537801Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:39:32 ip-10-0-73-87.ec2.internal dockerd[4325]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11)
on returned error: write unix /var/run/docker.sock->@: write: broken pipe"
Jul 07 03:40:35 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:35.735692652Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:40:44 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:44.165635762Z" level=error msg="stream copy error: reading from a closed fifo"
Jul 07 03:40:47 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:47.504174793Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:40:56 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:40:55.687736472Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
Jul 07 03:41:07 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:06.802768003Z" level=info msg="Container cd7ed93ae2d1 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Jul 07 03:41:17 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:17.058149021Z" level=warning msg="Published ports are discarded when using host network mode"
Jul 07 03:41:17 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:17.066107174Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 07 03:41:48 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:48.397339761Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
Jul 07 03:41:58 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:41:58.415920505Z" level=info msg="Container cd7ed93ae2d1 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Jul 07 03:42:28 ip-10-0-73-87.ec2.internal dockerd[4325]: time="2021-07-07T03:42:28.634636074Z" level=info msg="Container cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a failed to exit within 30 se
conds of signal 15 - using the force"
- The SIGKILL / 'using the force' messages continue indefinitely
- I've uploaded the dockerd stacktrace here - docker-stacktrace.txt
Containerd:
- In containerd, the task of the container is 'STOPPED' -
cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a 7064 STOPPED
- The task's metrics prints the following:
ID TIMESTAMP
cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a 2021-07-09 10:52:16.172595382 +0000 UTC
METRIC VALUE
memory.usage_in_bytes 20029440
memory.limit_in_bytes 9223372036854771712
memory.stat.cache 11411456
cpuacct.usage 69532053147
cpuacct.usage_percpu [14916245050 21098017259 14389165228 19128625610]
pids.current 0
pids.limit 0
- The container's info doesn't show something particularly helpful:
{
"ID": "cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
"Labels": {
"com.docker/engine.bundle.path": "/var/run/docker/containerd/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a"
},
"Image": "",
"Runtime": {
"Name": "io.containerd.runtime.v1.linux",
"Options": {
"type_url": "containerd.linux.runc.RuncOptions",
"value": "CgRydW5jEhwvdmFyL3J1bi9kb2NrZXIvcnVudGltZS1ydW5j"
}
},
"SnapshotKey": "",
"Snapshotter": "",
"CreatedAt": "2021-07-07T00:11:36.498772067Z",
"UpdatedAt": "2021-07-07T00:11:36.498772067Z",
"Extensions": null,
"Spec": {
"ociVersion": "1.0.1-dev",
...
},
...
}
- journalctl -u containerd does not have logs with the given container id
- I've uploaded the containerd stack trace here - containerd-goroutine-stacktrace.txt
containerd-shim:
- The containerd-shim of the container is still running:
root 7023 0.0 0.0 710748 6268 ? Sl Jul07 0:12 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f36
44815387a37a -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
- I tried to peek at the stdout / stderr of the shim in - but either the process hadn't flushed data, or it is simply empty
- To find the shim id, I've looked at the process fd-s and correlated the opened sockets w/ ss:
[root@ip-10-0-73-87 ~]# ls -al /proc/$pid/fd | awk '/socket/ { print $NF }' | grep -o '[0-9]*' | xargs -I{} sh -c "ss -apn | grep {}"
u_str LISTEN 0 4096 /run/containerd/s/6e99a634bfa5b915cbeade50e47384f60874a9358e5e96cb59523a46339c138b 25584
* 0 users:(("containerd-shim",pid=7023,fd=12),("containerd",pid=3538,fd=80))
u_str ESTAB 0 0 /run/containerd/s/6e99a634bfa5b915cbeade50e47384f60874a9358e5e96cb59523a46339c138b 24369
* 25595 users:(("containerd-shim",pid=7023,fd=3))
u_str ESTAB 0 0 * 25595
* 24369 users:(("containerd",pid=3538,fd=87))
-
Then I looked at the containerd journal entries for the shim id (6e99a6...) and the only thing that came up was:
Jul 07 00:11:36 ip-10-0-73-87.ec2.internal containerd[3538]: time="2021-07-07T00:11:36.521930693Z" level=info msg="shim containerd-shim started" address="unix:///run/containerd/s/6e99a634bfa5b915cbeade50e47384f6087 4a9358e5e96cb59523a46339c138b" debug=false pid=7023
-
I've managed to retrieve the shim's go stack trace by strace-ing to a file and sending a kill -USR1 to it, but I don't see anything of particular interest there:
shim stacktrace
write(8</var/lib/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a/shim.stdout.log>, "time="2021-07-09T11:56:32Z" level=info msg="=== BEGIN goroutine stack dump ===
goroutine 9 [running]:
main.dumpStacks(0xc000064150)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:276 +0x74
main.executeShim.func1(0xc00004a1e0, 0xc000064150)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:186 +0x3d
created by main.executeShim
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:184 +0x5e9
goroutine 1 [select]:
main.handleSignals(0xc000064150, 0xc0000a8720, 0xc00005c090, 0xc00006e000, 0xc0000d9e50, 0x0)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:239 +0x119
main.executeShim(0xc0000d2540, 0x7fe26c883088)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:189 +0x625
main.main()
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:118 +0x20a
goroutine 18 [chan receive]:
main.main.func1()
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:89 +0x85
created by main.main
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:88 +0x74
goroutine 6 [syscall]:
syscall.Syscall6(0xe8, 0xb, 0xc00003b9b8, 0x80, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x4)
/usr/lib/golang/src/syscall/asm_linux_amd64.s:41 +0x5
github.com/containerd/containerd/vendor/golang.org/x/sys/unix.EpollWait(0xb, 0xc00003b9b8, 0x80, 0x80, 0xffffffffffffffff, 0xffffffffffffffff, 0x85a8e0, 0xa5cb40)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/golang.org/x/sys/unix/zsyscall_linux_amd64.go:76 +0x72
github.com/containerd/containerd/vendor/github.com/containerd/console.(*Epoller).Wait(0xc00005c2d0, 0xc0000d9a78, 0x8)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/console/console_linux.go:111 +0x7a
created by github.com/containerd/containerd/runtime/v1/shim.(*Service).initPlatform
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service_linux.go:113 +0xbb
goroutine 5 [chan receive]:
github.com/containerd/containerd/runtime/v1/shim.(*Service).processExits(0xc00006e000)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:501 +0xd6
created by github.com/containerd/containerd/runtime/v1/shim.NewService
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:92 +0x40d
goroutine 24 [runnable]:
os/signal.process(0x85ec60, 0xa5cb70)
/usr/lib/golang/src/os/signal/signal.go:240 +0x10a
os/signal.loop()
/usr/lib/golang/src/os/signal/signal_unix.go:23 +0x45
created by os/signal.Notify.func1.1
/usr/lib/golang/src/os/signal/signal.go:150 +0x45
goroutine 7 [chan receive, 3375 minutes]:
github.com/containerd/containerd/runtime/v1/shim.(*Service).forward(0xc00006e000, 0x85a3e0, 0xc00001a030)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:579 +0x71
created by github.com/containerd/containerd/runtime/v1/shim.NewService
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/runtime/v1/shim/service.go:96 +0x4bf
goroutine 8 [IO wait, 23 minutes]:
internal/poll.runtime_pollWait(0x7fe26c882de8, 0x72, 0x0)
/usr/lib/golang/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc000074098, 0x72, 0x0, 0x0, 0x7fee5f)
/usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
/usr/lib/golang/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Accept(0xc000074080, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/lib/golang/src/internal/poll/fd_unix.go:394 +0x1fc
net.(*netFD).accept(0xc000074080, 0xc00009e458, 0xc0000121b8, 0x79ccc0)
/usr/lib/golang/src/net/fd_unix.go:172 +0x45
net.(*UnixListener).accept(0xc00005c330, 0xc00008fe50, 0xc00008fe58, 0x18)
/usr/lib/golang/src/net/unixsock_posix.go:162 +0x32
net.(*UnixListener).Accept(0xc00005c330, 0x812878, 0xc000012190, 0x862f20, 0xc0000a0000)
/usr/lib/golang/src/net/unixsock.go:260 +0x65
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve(0xc00005c090, 0x862f20, 0xc0000a0000, 0x861720, 0xc00005c330, 0x0, 0x0)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:87 +0x107
main.serve.func1(0x861720, 0xc00005c330, 0xc00005c090, 0x862f20, 0xc0000a0000)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:224 +0x88
created by main.serve
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:222 +0x1fe
goroutine 25 [select]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc00009e3c0, 0x862f20, 0xc0000a0000)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d
goroutine 10 [IO wait]:
internal/poll.runtime_pollWait(0x7fe26c882d08, 0x72, 0x85a8e0)
/usr/lib/golang/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc0000e2318, 0x72, 0x85a800, 0xa4b248, 0x0)
/usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
/usr/lib/golang/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0000e2300, 0xc000077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/lib/golang/src/internal/poll/fd_unix.go:159 +0x1a5
net.(*netFD).Read(0xc0000e2300, 0xc000077000, 0x1000, 0x1000, 0xc000056270, 0xc00018fc28, 0x20)
/usr/lib/golang/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0000a6078, 0xc000077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/lib/golang/src/net/net.go:182 +0x8e
bufio.(*Reader).Read(0xc00004a420, 0xc000078060, 0xa, 0xa, 0xc00018fd98, 0x447f74, 0xc00018ff18)
/usr/lib/golang/src/bufio/bufio.go:227 +0x222
io.ReadAtLeast(0x859fa0, 0xc00004a420, 0xc000078060, 0xa, 0xa, 0xa, 0xa, 0x2, 0x789f60)
/usr/lib/golang/src/io/io.go:314 +0x87
io.ReadFull(...)
/usr/lib/golang/src/io/io.go:333
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.readMessageHeader(0xc000078060, 0xa, 0xa, 0x859fa0, 0xc00004a420, 0x73, 0xc00002ec00, 0x0, 0x6b0aae)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:53 +0x69
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*channel).recv(0xc000078040, 0xc00018fe2c, 0x3, 0x2, 0xc00014a400, 0x0, 0x0, 0x0)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:101 +0x6b
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run.func1(0xc00005e120, 0xc00009e3c0, 0xc00005e1e0, 0xc000078040, 0xc00005e180, 0xc00004a480)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:362 +0x1b0
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:332 +0x2c5
goroutine 17533 [select, 23 minutes]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc00009e1e0, 0x862f20, 0xc0000a0000)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d
goroutine 17545 [select, 23 minutes]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run(0xc000012190, 0x862f20, 0xc0000a0000)
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:431 +0x433
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Server).Serve
/builddir/build/BUILD/containerd-1.4.1-2.amzn2/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:127 +0x28d
=== END goroutine stack dump ===" namespace=moby path=/run/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a pid=7023"
runc:
- There are no hanging runc processes which I could observe, nor any journal logs related; not sure how to introspect this layer after the fact.
-
runc --root /var/run/docker/runtime-runc/moby events cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a
shows "container with id ... is not running" -
runc list
shows the container as stopped -
runc state
shows the following:
{
"ociVersion": "1.0.2-dev",
"id": "cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
"pid": 0,
"status": "stopped",
"bundle": "/run/containerd/io.containerd.runtime.v1.linux/moby/cd7ed93ae2d106564609055e17b24679860bc6cfbfdb5c845f3644815387a37a",
"rootfs": "/var/lib/docker/overlay2/91052c9a423d3bee16e65973b097e7d0eb35baa5200e2c8f05f3484850e4c1c1/merged",
"created": "2021-07-07T00:11:36.60742938Z",
"owner": ""
}
os:
- The container's process is gone, meaning it was actually killed
What you expected to happen: Pods should terminate once their underlying container had died.
How to reproduce it (as minimally and precisely as possible): Not actually sure how to reproduce it consistently - it happens when creating and destroying nodes rapidly, I'd assume.
Environment
- AWS Region: us-east-1 (N. Virginia)
- Instance Type(s): t3.xlarge
- EKS Platform version: eks.4
- Kubernetes version: 1.19
- AMI Version: 1.19.6-20210628
- Kernel: 5.4.117-58.216.amzn2.x86_64
- Release information:
BASE_AMI_ID="ami-0136f7c838fded2f6"
BUILD_TIME="Mon Jun 28 16:39:26 UTC 2021"
BUILD_KERNEL="5.4.117-58.216.amzn2.x86_64"
ARCH="x86_64"
- Docker version output:
Client:
Version: 19.03.13-ce
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46
Built: Mon Oct 12 18:51:20 2020
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 19.03.13-ce
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46
Built: Mon Oct 12 18:51:50 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.1
GitCommit: c623d1b36f09f8ef6536a057bd658b3aa8632828
runc:
Version: 1.0.0-rc93
GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- Docker info output:
Client:
Debug Mode: false
Server:
Containers: 14
Running: 9
Paused: 0
Stopped: 5
Images: 8
Server Version: 19.03.13-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c623d1b36f09f8ef6536a057bd658b3aa8632828
runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
init version: de40ad0 (expected: fec3683)
Security Options:
seccomp
Profile: default
Kernel Version: 5.4.117-58.216.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.46GiB
Name: ip-10-0-73-87.ec2.internal
ID: BFFR:6SUN:2BSZ:4MB4:K5NO:OBN2:6VHK:Z2YQ:LS3U:KPEW:5TUV:AEBW
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true
Are you still seeing this on recent AMI's? This looks like an issue in containerd
, to me.