Flaky test `test_delete_after_create` - failed to remove cgroup #422
same failure as on #422
2023-12-11T16:24:37.6263250Z [1m[32m Finished[0m test [unoptimized] target(s) in 28.73s
2023-12-11T16:24:38.0688247Z ././target/build/x86_64-unknown-linux-musl//x86_64-unknown-linux-musl/debug/deps/containerd_shim_wasmer-4cf9e40c15afbde7 --nocapture --test-threads=1
2023-12-11T16:24:38.1117914Z
2023-12-11T16:24:38.1118414Z running 7 tests
2023-12-11T16:24:38.6147511Z test wasmer_tests::test_custom_entrypoint ... ok
2023-12-11T16:24:38.7404433Z [2023-12-11T16:24:38Z ERROR libcontainer::container::container_delete] failed to remove cgroup due to: V2(WrappedIo(Other { err: Custom { kind: TimedOut, error: "could not delete" }, path: "/sys/fs/cgroup/test" })) cgroup_path="test"
2023-12-11T16:24:38.7408303Z Error: io error: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7409256Z
2023-12-11T16:24:38.7409508Z Caused by:
2023-12-11T16:24:38.7409998Z 0: io error: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7411140Z 1: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7412084Z test wasmer_tests::test_delete_after_create ... FAILED
2023-12-11T16:24:39.6997094Z test wasmer_tests::test_exit_code ... ok
2023-12-11T16:24:40.4872602Z test wasmer_tests::test_has_default_devices ... ok
2023-12-11T16:24:40.9817523Z test wasmer_tests::test_hello_world ... ok
2023-12-11T16:24:41.9711765Z test wasmer_tests::test_seccomp ... ok
2023-12-11T16:24:42.3465552Z [2023-12-11T16:24:42Z ERROR wasmer_wasix::state::env] wasi[1]::wasi initialize error (Missing export memory)
2023-12-11T16:24:42.8445822Z test wasmer_tests::test_unreachable ... ok
2023-12-11T16:24:42.8446223Z
2023-12-11T16:24:42.8446548Z failures:
2023-12-11T16:24:42.8446734Z
2023-12-11T16:24:42.8446917Z failures:
2023-12-11T16:24:42.8447238Z wasmer_tests::test_delete_after_create
2023-12-11T16:24:42.8447557Z
2023-12-11T16:24:42.8448359Z test result: FAILED. 6 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 4.73s
2023-12-11T16:24:42.8449129Z
2023-12-11T16:24:42.8470890Z make: *** [Makefile:126: test-wasmer] Error 123
2023-12-11T16:24:42.8487753Z ##[error]Process completed with exit code 2.
Originally posted by @jsturtevant in https://github.com/containerd/runwasi/issues/420#issuecomment-1850500767
Another instance:
https://github.com/containerd/runwasi/actions/runs/7332652774/job/19967109853?pr=428
https://github.com/containerd/runwasi/actions/runs/7450055138/job/20268090826?pr=444
AFAICS the failing test point is always test_delete_after_create.
Unlike the other test points, this one never calls start() nor wait().
I can imagine two options:
- Some setup is happening in
startthat this tespoint doesn't hit. But then I would expect this test point to always fail. - There's some race condition between
deleteand the cgroups path being created. As we usually wait for the child process to finish, this is not a proble, but in this test point we might be callingdeletetoo quickly? This seems the most likely to me.
I don't have the bandwidth to validate any of this at the moment, but I can take a look next week.
The failure is not specific to wasmer. I can reproduce with any runtime.
I'm running
while RUST_LOG=trace cargo test --quiet --package containerd-shim-wasmtime -- --nocapture --test-threads=1 test_delete_after_create; do echo .; done
That sporadically fails to remove the cgroup folder (that's why it's in a loop).
It seems that youki's libcgroups is already trying to work around this issue (see https://github.com/containers/youki/pull/63 and https://github.com/containers/youki/pull/333), where youki tries to delete the folder a few times with a small delay between attempts.
The underlying cause is that trying to remove the folder results in Err(Os { code: 16, kind: ResourceBusy, message: "Device or resource busy" }).
Locally, I usually see that the first attempt fails, and the second succeeds, but some times all 4 attempts fail.
When we delete the container, we send SIGKILL to the container init process. IIUC, it takes some time from when we send SIGKILL to the process, until the kernel allows us to delete the cgroup. My best guess is that we could increase the number of attempts that youki makes to remove the folder. But that doesn't explain why we only see the failure in this test and not others. I should try reproduce this with youki alone.
ping @utam0k
Other test cases in this file take 10 seconds. It may be related... I'm not sure... https://github.com/containerd/runwasi/blob/71f8df9cc576ea9564b3ea692432e80b454da7e5/crates/containerd-shim-wasmer/src/tests.rs#L12
It seems runc attempts 100 retries. We need to implement the same logic... 😭
https://github.com/opencontainers/runc/blob/02120488a4c0fc487d1ed2867e901eeed7ce8ecf/libcontainer/state_linux.go#L49-L51
https://github.com/opencontainers/runc/blob/02120488a4c0fc487d1ed2867e901eeed7ce8ecf/delete.go#L16-L25
That 100 retries is wild.