runwasi Flaky test `test_delete_after_create` - failed to remove cgroup #422

          same failure as on #422

2023-12-11T16:24:37.6263250Z [1m[32m    Finished[0m test [unoptimized] target(s) in 28.73s
2023-12-11T16:24:38.0688247Z ././target/build/x86_64-unknown-linux-musl//x86_64-unknown-linux-musl/debug/deps/containerd_shim_wasmer-4cf9e40c15afbde7 --nocapture --test-threads=1
2023-12-11T16:24:38.1117914Z 
2023-12-11T16:24:38.1118414Z running 7 tests
2023-12-11T16:24:38.6147511Z test wasmer_tests::test_custom_entrypoint ... ok
2023-12-11T16:24:38.7404433Z [2023-12-11T16:24:38Z ERROR libcontainer::container::container_delete] failed to remove cgroup due to: V2(WrappedIo(Other { err: Custom { kind: TimedOut, error: "could not delete" }, path: "/sys/fs/cgroup/test" })) cgroup_path="test"
2023-12-11T16:24:38.7408303Z Error: io error: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7409256Z 
2023-12-11T16:24:38.7409508Z Caused by:
2023-12-11T16:24:38.7409998Z     0: io error: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7411140Z     1: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7412084Z test wasmer_tests::test_delete_after_create ... FAILED
2023-12-11T16:24:39.6997094Z test wasmer_tests::test_exit_code ... ok
2023-12-11T16:24:40.4872602Z test wasmer_tests::test_has_default_devices ... ok
2023-12-11T16:24:40.9817523Z test wasmer_tests::test_hello_world ... ok
2023-12-11T16:24:41.9711765Z test wasmer_tests::test_seccomp ... ok
2023-12-11T16:24:42.3465552Z [2023-12-11T16:24:42Z ERROR wasmer_wasix::state::env] wasi[1]::wasi initialize error (Missing export memory)
2023-12-11T16:24:42.8445822Z test wasmer_tests::test_unreachable ... ok
2023-12-11T16:24:42.8446223Z 
2023-12-11T16:24:42.8446548Z failures:
2023-12-11T16:24:42.8446734Z 
2023-12-11T16:24:42.8446917Z failures:
2023-12-11T16:24:42.8447238Z     wasmer_tests::test_delete_after_create
2023-12-11T16:24:42.8447557Z 
2023-12-11T16:24:42.8448359Z test result: FAILED. 6 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 4.73s
2023-12-11T16:24:42.8449129Z 
2023-12-11T16:24:42.8470890Z make: *** [Makefile:126: test-wasmer] Error 123
2023-12-11T16:24:42.8487753Z ##[error]Process completed with exit code 2.

Originally posted by @jsturtevant in https://github.com/containerd/runwasi/issues/420#issuecomment-1850500767

Dec 11 '23 20:12 jsturtevant

Another instance:

https://github.com/containerd/runwasi/actions/runs/7332652774/job/19967109853?pr=428

Jan 04 '24 18:01 Mossaka

https://github.com/containerd/runwasi/actions/runs/7450055138/job/20268090826?pr=444

Jan 08 '24 16:01 Mossaka

AFAICS the failing test point is always test_delete_after_create. Unlike the other test points, this one never calls start() nor wait(). I can imagine two options:

Some setup is happening in start that this tespoint doesn't hit. But then I would expect this test point to always fail.
There's some race condition between delete and the cgroups path being created. As we usually wait for the child process to finish, this is not a proble, but in this test point we might be calling delete too quickly? This seems the most likely to me.

I don't have the bandwidth to validate any of this at the moment, but I can take a look next week.

Jan 26 '24 17:01 jprendes

The failure is not specific to wasmer. I can reproduce with any runtime.

I'm running

while RUST_LOG=trace cargo test --quiet --package containerd-shim-wasmtime -- --nocapture --test-threads=1 test_delete_after_create; do echo .; done

That sporadically fails to remove the cgroup folder (that's why it's in a loop).

It seems that youki's libcgroups is already trying to work around this issue (see https://github.com/containers/youki/pull/63 and https://github.com/containers/youki/pull/333), where youki tries to delete the folder a few times with a small delay between attempts.

The underlying cause is that trying to remove the folder results in Err(Os { code: 16, kind: ResourceBusy, message: "Device or resource busy" }). Locally, I usually see that the first attempt fails, and the second succeeds, but some times all 4 attempts fail.

Feb 06 '24 14:02 jprendes

When we delete the container, we send SIGKILL to the container init process. IIUC, it takes some time from when we send SIGKILL to the process, until the kernel allows us to delete the cgroup. My best guess is that we could increase the number of attempts that youki makes to remove the folder. But that doesn't explain why we only see the failure in this test and not others. I should try reproduce this with youki alone.

ping @utam0k

Feb 06 '24 14:02 jprendes

Other test cases in this file take 10 seconds. It may be related... I'm not sure... https://github.com/containerd/runwasi/blob/71f8df9cc576ea9564b3ea692432e80b454da7e5/crates/containerd-shim-wasmer/src/tests.rs#L12

It seems runc attempts 100 retries. We need to implement the same logic... 😭 https://github.com/opencontainers/runc/blob/02120488a4c0fc487d1ed2867e901eeed7ce8ecf/libcontainer/state_linux.go#L49-L51 https://github.com/opencontainers/runc/blob/02120488a4c0fc487d1ed2867e901eeed7ce8ecf/delete.go#L16-L25

Feb 07 '24 12:02 utam0k

That 100 retries is wild.

Feb 07 '24 19:02 Mossaka