riju icon indicating copy to clipboard operation
riju copied to clipboard

setting cgroup config for procHooks process caused: failed to write "100000"

Open raxod502 opened this issue 3 years ago • 0 comments

https://riju.statuspage.io/incidents/xc559lskkttw was caused by an error which for some reason did not show up in container logs, but was visible when I connected to the EC2 instance and tried to start a session manually:

admin@ip-172-31-1-13:~$ sudo docker exec -it riju-app-green bash
riju@93ea824572b0:/src$ make sandbox L=python
L=python node backend/sandbox.js
Starting session with UUID 3f13a0f56a4844d1b8972c0a2aed3102
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "100000": write /sys/fs/cgroup/cpu,cpuacct/riju.slice/docker-fb919378f50b91e7e4e6e070b853342a3b4dbbb468dc5bfc4487264f0286050f.scope/cpu.cfs_quota_us: invalid argument: unknown.
ERRO[0000] error waiting for container: context canceled 
container did not come up within 10 seconds (errno 17)

For some reason, when I applied https://github.com/raxod502/riju/commit/0d92a7792235e48e1b84a095eec5e1bd1febc2b9 to the production server, it started causing the above issue, and when I reverted those changes, the issue went away. However, additional testing made me uncertain as to whether the above changes actually triggered the problem.

The issue may be due to https://github.com/kubernetes/kubernetes/issues/72878, which points to a kernel bug that was patched some time ago. We would need to verify that the patch is included in the kernel version we are running on EC2.

raxod502 avatar Oct 24 '21 20:10 raxod502