gvisor
gvisor copied to clipboard
EINVAL writing cgroup.procs when creating container
Description
We're seeing a rare issue where creating the gvisor container fails when it tries to join the cgroup it just created, getting back EINVAL from the kernel when it tries to write its process ID to the cgroup hierarchy. The only related thing we've come across is a ticket from opencontainers/runc#1884 but we don't have enough hits to reproduce it or verify that it is related.
Is there another situation in which the kernel would return EINVAL when a process tries to join a cgroup? I doubt the fix that runc implemented (using a C library shim that runs before the Go runtime starts) is an acceptable fix here, but maybe a simple retry would be enough?
This is on runsc release-20220905, host kernel is 5.15
Steps to reproduce
No response
runsc version
`runsc version release-20220905.0
docker version (if using docker)
No response
uname
No response
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response
Hi @jseba, are you able to get any runsc logs from the failing containers? That could help in diagnosing the exact issue. Also, does the issue happen more frequently with high exec load like in other bug?
If there's a way to get debug logs only when the container is created and not while it executes, I might be able to get something. Otherwise we simply don't have the storage space for these logs as this issue hits only once or twice a month even with as many containers as we launch.
@jseba, please patch #8076 and add --debug-command=create (or --debug-command=!boot,gofer,events to remove the more verbose logs) to the runtime options to restrict what commands are logged.
Okay, finally managed to get some debug logs from the create command, nothing obvious stands out to me. This is from version release-20230214.0
D0223 13:14:52.814844 3874296 container.go:185] Create container, cid: service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298, rootDir: "/var/run/runsc"
D0223 13:14:52.814995 3874296 container.go:244] Creating new sandbox for container, cid: service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298
D0223 13:14:52.815636 3874296 cgroup.go:410] New cgroup for pid: self, *cgroup.cgroupV2: &{Mountpoint:/sys/fs/cgroup Path:containers/service_group/service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298/gv Controllers:[cpuset cpu io memory hugetlb pids misc] Own:[]}
D0223 13:14:52.815685 3874296 cgroup_v2.go:129] Installing cgroup path "/sys/fs/cgroup/containers/service_group/service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298/gv"
D0223 13:14:52.835990 3874296 cgroup.go:116] Setting "/sys/fs/cgroup/containers/service_group/service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298/gv/cpu.max" to "43478 10000"
D0223 13:14:52.836277 3874296 cgroup.go:116] Setting "/sys/fs/cgroup/containers/service_group/service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298/gv/memory.max" to "1887436800"
D0223 13:14:52.836337 3874296 cgroup.go:116] Setting "/sys/fs/cgroup/containers/service_group/service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298/gv/pids.max" to "4096"
D0223 13:14:52.837098 3874296 cgroup.go:116] Setting "/sys/fs/cgroup/containers/service_group/service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298/gv/cgroup.procs" to "0"
D0223 13:14:52.837291 3874296 cgroup_v2.go:215] Restoring cgroup "/sys/fs/cgroup/containers/service_group/service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298/cc"
D0223 13:14:52.837332 3874296 cgroup.go:116] Setting "/sys/fs/cgroup/containers/service_group/service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298/mgmt/cgroup.procs" to "0"
D0223 13:14:52.837945 3874296 container.go:723] Destroy container, cid: service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298
W0223 13:14:52.838090 3874296 util.go:64] FATAL ERROR: creating container: write /sys/fs/cgroup/containers/service_group/service_7c5c93f7-9cc9-3811-e398-ef3d2374ce89_e75e9298/gv/cgroup.procs: invalid argument
W0223 13:14:52.838198 3874296 main.go:261] Failure to execute command, err: 1
Some context on our cgroup layout: we have a top-level cgroup for all containers containers, which has resource constraints for all containers running on the metals combined. Then there's a cgroup for all containers that make up a service group to limit the entire group relative to others, service_group. Below this is where the actual containers go, and each container has two cgroups due to the cgroupv2 leaf restrictions around no processes in non-leaf nodes of the tree. There's one for the management processes/sidecars, mgmt, and one for the gvisor sentry and gofer, gv. We don't enable cgroup threading, so all accounting should be by process.
Let me know if there's anything more I can provide or if there's a patch you'd like me to apply and try.
A friendly reminder that this issue had no activity for 120 days.
The only related thing we've come across is a ticket from opencontainers/runc#1884 but we don't have enough hits to reproduce it or verify that it is related.
It can be the root cause of this issue. btw: it was fixed in the kernel: https://github.com/torvalds/linux/commit/df16b71c686cb096774e30153c9ce6756450796c
A friendly reminder that this issue had no activity for 120 days.
This issue has been closed due to lack of activity.