OCPBUGS-28981: fix busyloop when killing container
This fixes the busy loop on a failed kill (D-state process) and reschedules the timer to sometime in the future. Since targetTime has expired, the loop re-enters the kill case statement without waiting. The pod is terminal at this point, so a 10ms pause seems ok.
Does this PR introduce a user-facing change?
Fix busy loop on killing a D-state container
@rphillips: This pull request references Jira Issue OCPBUGS-28981, which is invalid:
- expected the bug to target the "4.16.0" version, but no target version was set
Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
In response to this:
This fixes the busy loop on a failed kill (D-state process) and reschedules the timer to sometime in the future. Since targetTime has expired, the loop re-enters the kill case statement without waiting. The pod is terminal at this point, so a 10ms pause seems ok.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: rphillips Once this PR has been reviewed and has the lgtm label, please assign giuseppe for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/jira refresh
@rphillips: This pull request references Jira Issue OCPBUGS-28981, which is valid. The bug has been moved to the POST state.
3 validation(s) were run on this bug
- bug is open, matching expected state (open)
- bug target version (4.16.0) matches configured target version for branch (4.16.0)
- bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.
In response to this:
/jira refresh
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Codecov Report
Merging #7736 (eeb0ad8) into main (8510499) will increase coverage by
0.00%. The diff coverage is88.46%.
Additional details and impacted files
@@ Coverage Diff @@
## main #7736 +/- ##
=======================================
Coverage 47.95% 47.95%
=======================================
Files 146 146
Lines 16274 16283 +9
=======================================
+ Hits 7804 7809 +5
- Misses 7517 7521 +4
Partials 953 953
I think this would be better added to the code already i n https://github.com/cri-o/cri-o/blob/main/internal/oci/container.go#L530-L540
or more precisely https://github.com/cri-o/cri-o/blob/main/internal/oci/container_linux.go#L53
@haircommander If I move it to container_linux, then the freebsd portion is not implemented.
Updated to not be implemented on bsd.
sorry I may not have been clear. Queries to /proc are not trivially cheap and I think we should consolidate reading from /proc/pid/status if possible. We're already reading the start time whenever we check a Living(), we could extend the getPidStartTime function to be getPidStartAndState, and read it in the same function. we could then treat processes in D state as not running
@rphillips: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/e2e-gcp-ovn | 5c111c94c50a633edf10e89a3c5969c5fe55edfe | link | true | /test e2e-gcp-ovn |
| ci/prow/ci-cgroupv2-integration | 5c111c94c50a633edf10e89a3c5969c5fe55edfe | link | true | /test ci-cgroupv2-integration |
| ci/prow/ci-crun-e2e | 5c111c94c50a633edf10e89a3c5969c5fe55edfe | link | true | /test ci-crun-e2e |
| ci/prow/ci-fedora-integration | 5c111c94c50a633edf10e89a3c5969c5fe55edfe | link | true | /test ci-fedora-integration |
| ci/prow/ci-e2e | 5c111c94c50a633edf10e89a3c5969c5fe55edfe | link | true | /test ci-e2e |
| ci/prow/ci-e2e-evented-pleg | 5c111c94c50a633edf10e89a3c5969c5fe55edfe | link | true | /test ci-e2e-evented-pleg |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
/assign @rphillips
I'm wondering if we should just add an exponential backoff and call it a day. We can't easily communicate to the kubelet the process is in "D" state without exiting StopLoopForContainer, which would require it is reentrant, or potentially cause other transient failures (runc kill decides to take a while?) from killing the container. The safest and simplest is to just try less frequently and wait for the kernel to catch up. What do folks think?
[...] The safest and simplest is to just try less frequently and wait for the kernel to catch up. What do folks think?
Personally, I believe a simple solution might be the best served here. Otherwise, there are a lot of edge cases we might have to potentially address to make the stopping code robust.
To set the scene a little:
- Signals are reentrant, non-blocking and asynchronous
- Sending a signal to a process that is in the uninterruptible sleep without the TASK_KILLABLE set will be set as pending
- There is no way for us to correctly ascertain from the userspace the type of the "D" process state
- Has the TASK_UNINTERRUPTIBLE state been set without the TASK_KILLABLE flag or not
- Process can be "blocked" directly or become irresponsible should one of its children is currently in a blocking state
While testing the proposed changes I was able to identify a few potential issues:
(A small Linux kernel module used to aid testing can be found here).
- Testing a single process within a container that was in the blocked state revealed a potential race or ordering issue
panic: send on closed channel
goroutine 1810 [running]:
github.com/cri-o/cri-o/internal/oci.(*Container).WaitOnStopTimeout(0xc00089c000, {0x2e3beb0, 0xc0005d0150}, 0xa)
github.com/cri-o/cri-o/internal/oci/container.go:604 +0xc5
github.com/cri-o/cri-o/internal/oci.(*runtimeOCI).StopContainer(0xc000153620, {0x2e3beb0?, 0xc0005d0120?}, 0xc00089c000, 0xa)
github.com/cri-o/cri-o/internal/oci/runtime_oci.go:816 +0x352
github.com/cri-o/cri-o/internal/oci.(*Runtime).StopContainer(0xc0003cfb90, {0x2e3beb0?, 0xc0005d0090?}, 0xc00089c000, 0xa)
github.com/cri-o/cri-o/internal/oci/oci.go:309 +0x102
github.com/cri-o/cri-o/server.(*Server).stopContainer(0xc000452808, {0x2e3beb0?, 0xc001069bf0?}, 0xc00089c000, 0xa)
github.com/cri-o/cri-o/server/container_stop.go:60 +0x40b
github.com/cri-o/cri-o/server.(*Server).stopPodSandbox.func1()
github.com/cri-o/cri-o/server/sandbox_stop_linux.go:63 +0x38
golang.org/x/sync/errgroup.(*Group).Go.func1()
golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1785
golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96
Related logs:
time="2024-02-20 02:36:26.182569470Z" level=warning msg="Stopping container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791 with stop signal timed out. Killing" file="oci/runtime_oci.go:875"
time="2024-02-20 02:36:26.189833904Z" level=debug msg="Unmounted container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791" file="storage/runtime.go:495" id=e026368c-a07e-478e-ac8e-bb9218b5fec3 name=/runtime.v1.RuntimeService/StopContainer
time="2024-02-20 02:36:26.195471499Z" level=info msg="Stopped container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791: default/test/test" file="server/container_stop.go:29" id=e026368c-a07e-478e-ac8e-bb9218b5fec3 name=/runtime.v1.RuntimeService/StopContainer
time="2024-02-20 02:36:26.196017909Z" level=debug msg="Request: &StopPodSandboxRequest{PodSandboxId:07327609189e09d497a5647c3befaa3e6f3d64119591ba82d6821919039ec726,}" file="otel-collector/interceptors.go:62" id=08b90184-ad3d-4172-9f2e-9058d0d1b9ca name=/runtime.v1.RuntimeService/StopPodSandbox
time="2024-02-20 02:36:26.196048272Z" level=info msg="Stopping pod sandbox: 07327609189e09d497a5647c3befaa3e6f3d64119591ba82d6821919039ec726" file="server/sandbox_stop.go:18" id=08b90184-ad3d-4172-9f2e-9058d0d1b9ca name=/runtime.v1.RuntimeService/StopPodSandbox
The panic occurs following the SetAsDoneStopping() function that closes the stopTimeoutChan channel while there might still be a pending new timeout to send through from the StopContainer() function.
- The 10 milliseconds delay is not set (renewed) as a new timeout
The following block is never run as sending a signal to a blocked process, albeit a process that exists, would not solicit any errors from the kernel (syscall).
if _, err := r.runtimeCmd("kill", c.ID(), "KILL"); err != nil {
log.Errorf(ctx, "Killing container %v failed: %v", c.ID(), err)
targetTime = time.Now().Add(10 * time.Millisecond)
}
Thus, when runc or crun are asked to kill the containers, their internal code does not fail, and as such, we see an error is set, which means that we won't be updating the targetTime variable with the new value.
- The
Living()function had to be taught to skip process state checks
A need to pass a boolean through the Living() to the pid() to then verifyPid() denoting whether checking for the process state should be performed or not. This was to allow for an early, or initial, check of the process liveness to be performed that needed to ignore a potentially momentary "D" state of a given process (a process might have been sleeping momentarily and such).
- We can only check the process state of the main, or the "init" process, running within the container
If a single process runs within the container or so happens this process is the first process, then we can easily access and check its state—should it be "D" and we were going to terminate the process anyway, then the undesired state will be correctly detected and the loop will be terminated early, as intended.
However, if the process inside spawned a child process (one or more) and one of these children that aren't detached is currently blocked and might remain in such a state for a longer period of time, then the parent will also be unresponsive, and such, sending termination signals to it wouldn't have any desired effect—the signals will be simply pending.
This, in turn, means that we won't be able to detect the "D" state of the process, container, etc., while on the path to termination, so the loop will continue to spin, exhibiting the same behaviour as the original code currently does.
This is because we only pass the state.InitPid from the Container type each time we look at the process running within the container. Since the first or "init" process would birth the container into existence, it does make sense to check for this specific process only.
- Can container storage be unmounted while the process is still running?
When a single process running within the container is due to be terminated and it happens to be in the blocking "D" state, then we would terminate the busy loop early. Then, the following code would run:
https://github.com/cri-o/cri-o/blob/87ff09f9e045797ae43c1a7bde121b6cecb76954/internal/storage/runtime.go#L468-L497
The question here would be: is this safe to run for a process (container) that might still be running for a while?
The question here would be: is this safe to run for a process (container) that might still be running for a while?
hard to tell without knowing exactly the stage the oci runtime fails to kill the container, but my guess is we'd get EBUSY on attempting to unmount the storage. I think retrying with an exponential backoff until the process is gone from the process table is the only safe and sustainable solution.
@rphillips: This pull request references Jira Issue OCPBUGS-28981. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.
In response to this:
This fixes the busy loop on a failed kill (D-state process) and reschedules the timer to sometime in the future. Since targetTime has expired, the loop re-enters the kill case statement without waiting. The pod is terminal at this point, so a 10ms pause seems ok.
Does this PR introduce a user-facing change?
Fix busy loop on killing a D-state container
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.