riju
riju copied to clipboard
setting cgroup config for procHooks process caused: failed to write "100000"
https://riju.statuspage.io/incidents/xc559lskkttw was caused by an error which for some reason did not show up in container logs, but was visible when I connected to the EC2 instance and tried to start a session manually:
admin@ip-172-31-1-13:~$ sudo docker exec -it riju-app-green bash
riju@93ea824572b0:/src$ make sandbox L=python
L=python node backend/sandbox.js
Starting session with UUID 3f13a0f56a4844d1b8972c0a2aed3102
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "100000": write /sys/fs/cgroup/cpu,cpuacct/riju.slice/docker-fb919378f50b91e7e4e6e070b853342a3b4dbbb468dc5bfc4487264f0286050f.scope/cpu.cfs_quota_us: invalid argument: unknown.
ERRO[0000] error waiting for container: context canceled
container did not come up within 10 seconds (errno 17)
For some reason, when I applied https://github.com/raxod502/riju/commit/0d92a7792235e48e1b84a095eec5e1bd1febc2b9 to the production server, it started causing the above issue, and when I reverted those changes, the issue went away. However, additional testing made me uncertain as to whether the above changes actually triggered the problem.
The issue may be due to https://github.com/kubernetes/kubernetes/issues/72878, which points to a kernel bug that was patched some time ago. We would need to verify that the patch is included in the kernel version we are running on EC2.