cri-o OCPBUGS-28981: fix busyloop when killing container

This fixes the busy loop on a failed kill (D-state process) and reschedules the timer to sometime in the future. Since targetTime has expired, the loop re-enters the kill case statement without waiting. The pod is terminal at this point, so a 10ms pause seems ok.

Does this PR introduce a user-facing change?

Fix busy loop on killing a D-state container

Feb 05 '24 23:02 rphillips

@rphillips: This pull request references Jira Issue OCPBUGS-28981, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This fixes the busy loop on a failed kill (D-state process) and reschedules the timer to sometime in the future. Since targetTime has expired, the loop re-enters the kill case statement without waiting. The pod is terminal at this point, so a 10ms pause seems ok.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Feb 05 '24 23:02 openshift-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rphillips Once this PR has been reviewed and has the lgtm label, please assign giuseppe for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Feb 05 '24 23:02 openshift-ci[bot]

/jira refresh

Feb 05 '24 23:02 rphillips

@rphillips: This pull request references Jira Issue OCPBUGS-28981, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Feb 05 '24 23:02 openshift-ci-robot

Codecov Report

Merging #7736 (eeb0ad8) into main (8510499) will increase coverage by 0.00%. The diff coverage is 88.46%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #7736   +/-   ##
=======================================
  Coverage   47.95%   47.95%           
=======================================
  Files         146      146           
  Lines       16274    16283    +9     
=======================================
+ Hits         7804     7809    +5     
- Misses       7517     7521    +4     
  Partials      953      953

Feb 05 '24 23:02 codecov[bot]

I think this would be better added to the code already i n https://github.com/cri-o/cri-o/blob/main/internal/oci/container.go#L530-L540

or more precisely https://github.com/cri-o/cri-o/blob/main/internal/oci/container_linux.go#L53

Feb 06 '24 15:02 haircommander

@haircommander If I move it to container_linux, then the freebsd portion is not implemented.

Updated to not be implemented on bsd.

Feb 06 '24 15:02 rphillips

sorry I may not have been clear. Queries to /proc are not trivially cheap and I think we should consolidate reading from /proc/pid/status if possible. We're already reading the start time whenever we check a Living(), we could extend the getPidStartTime function to be getPidStartAndState, and read it in the same function. we could then treat processes in D state as not running

Feb 06 '24 16:02 haircommander

@rphillips: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-ovn	5c111c94c50a633edf10e89a3c5969c5fe55edfe	link	true	`/test e2e-gcp-ovn`
ci/prow/ci-cgroupv2-integration	5c111c94c50a633edf10e89a3c5969c5fe55edfe	link	true	`/test ci-cgroupv2-integration`
ci/prow/ci-crun-e2e	5c111c94c50a633edf10e89a3c5969c5fe55edfe	link	true	`/test ci-crun-e2e`
ci/prow/ci-fedora-integration	5c111c94c50a633edf10e89a3c5969c5fe55edfe	link	true	`/test ci-fedora-integration`
ci/prow/ci-e2e	5c111c94c50a633edf10e89a3c5969c5fe55edfe	link	true	`/test ci-e2e`
ci/prow/ci-e2e-evented-pleg	5c111c94c50a633edf10e89a3c5969c5fe55edfe	link	true	`/test ci-e2e-evented-pleg`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Feb 07 '24 03:02 openshift-ci[bot]

/assign @rphillips

Feb 13 '24 06:02 kwilczynski

I'm wondering if we should just add an exponential backoff and call it a day. We can't easily communicate to the kubelet the process is in "D" state without exiting StopLoopForContainer, which would require it is reentrant, or potentially cause other transient failures (runc kill decides to take a while?) from killing the container. The safest and simplest is to just try less frequently and wait for the kernel to catch up. What do folks think?

Feb 15 '24 14:02 haircommander

[...] The safest and simplest is to just try less frequently and wait for the kernel to catch up. What do folks think?

Personally, I believe a simple solution might be the best served here. Otherwise, there are a lot of edge cases we might have to potentially address to make the stopping code robust.

To set the scene a little:

Signals are reentrant, non-blocking and asynchronous
Sending a signal to a process that is in the uninterruptible sleep without the TASK_KILLABLE set will be set as pending
There is no way for us to correctly ascertain from the userspace the type of the "D" process state
- Has the TASK_UNINTERRUPTIBLE state been set without the TASK_KILLABLE flag or not
Process can be "blocked" directly or become irresponsible should one of its children is currently in a blocking state

While testing the proposed changes I was able to identify a few potential issues:

(A small Linux kernel module used to aid testing can be found here).

Testing a single process within a container that was in the blocked state revealed a potential race or ordering issue

panic: send on closed channel

goroutine 1810 [running]:
github.com/cri-o/cri-o/internal/oci.(*Container).WaitOnStopTimeout(0xc00089c000, {0x2e3beb0, 0xc0005d0150}, 0xa)
	github.com/cri-o/cri-o/internal/oci/container.go:604 +0xc5
github.com/cri-o/cri-o/internal/oci.(*runtimeOCI).StopContainer(0xc000153620, {0x2e3beb0?, 0xc0005d0120?}, 0xc00089c000, 0xa)
	github.com/cri-o/cri-o/internal/oci/runtime_oci.go:816 +0x352
github.com/cri-o/cri-o/internal/oci.(*Runtime).StopContainer(0xc0003cfb90, {0x2e3beb0?, 0xc0005d0090?}, 0xc00089c000, 0xa)
	github.com/cri-o/cri-o/internal/oci/oci.go:309 +0x102
github.com/cri-o/cri-o/server.(*Server).stopContainer(0xc000452808, {0x2e3beb0?, 0xc001069bf0?}, 0xc00089c000, 0xa)
	github.com/cri-o/cri-o/server/container_stop.go:60 +0x40b
github.com/cri-o/cri-o/server.(*Server).stopPodSandbox.func1()
	github.com/cri-o/cri-o/server/sandbox_stop_linux.go:63 +0x38
golang.org/x/sync/errgroup.(*Group).Go.func1()
	golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1785
	golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96

Related logs:

time="2024-02-20 02:36:26.182569470Z" level=warning msg="Stopping container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791 with stop signal timed out. Killing" file="oci/runtime_oci.go:875"
time="2024-02-20 02:36:26.189833904Z" level=debug msg="Unmounted container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791" file="storage/runtime.go:495" id=e026368c-a07e-478e-ac8e-bb9218b5fec3 name=/runtime.v1.RuntimeService/StopContainer
time="2024-02-20 02:36:26.195471499Z" level=info msg="Stopped container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791: default/test/test" file="server/container_stop.go:29" id=e026368c-a07e-478e-ac8e-bb9218b5fec3 name=/runtime.v1.RuntimeService/StopContainer
time="2024-02-20 02:36:26.196017909Z" level=debug msg="Request: &StopPodSandboxRequest{PodSandboxId:07327609189e09d497a5647c3befaa3e6f3d64119591ba82d6821919039ec726,}" file="otel-collector/interceptors.go:62" id=08b90184-ad3d-4172-9f2e-9058d0d1b9ca name=/runtime.v1.RuntimeService/StopPodSandbox
time="2024-02-20 02:36:26.196048272Z" level=info msg="Stopping pod sandbox: 07327609189e09d497a5647c3befaa3e6f3d64119591ba82d6821919039ec726" file="server/sandbox_stop.go:18" id=08b90184-ad3d-4172-9f2e-9058d0d1b9ca name=/runtime.v1.RuntimeService/StopPodSandbox

The panic occurs following the SetAsDoneStopping() function that closes the stopTimeoutChan channel while there might still be a pending new timeout to send through from the StopContainer() function.

The 10 milliseconds delay is not set (renewed) as a new timeout

The following block is never run as sending a signal to a blocked process, albeit a process that exists, would not solicit any errors from the kernel (syscall).

if _, err := r.runtimeCmd("kill", c.ID(), "KILL"); err != nil {
	log.Errorf(ctx, "Killing container %v failed: %v", c.ID(), err)
	targetTime = time.Now().Add(10 * time.Millisecond)
}

Thus, when runc or crun are asked to kill the containers, their internal code does not fail, and as such, we see an error is set, which means that we won't be updating the targetTime variable with the new value.

The Living() function had to be taught to skip process state checks

A need to pass a boolean through the Living() to the pid() to then verifyPid() denoting whether checking for the process state should be performed or not. This was to allow for an early, or initial, check of the process liveness to be performed that needed to ignore a potentially momentary "D" state of a given process (a process might have been sleeping momentarily and such).

We can only check the process state of the main, or the "init" process, running within the container

If a single process runs within the container or so happens this process is the first process, then we can easily access and check its state—should it be "D" and we were going to terminate the process anyway, then the undesired state will be correctly detected and the loop will be terminated early, as intended.

However, if the process inside spawned a child process (one or more) and one of these children that aren't detached is currently blocked and might remain in such a state for a longer period of time, then the parent will also be unresponsive, and such, sending termination signals to it wouldn't have any desired effect—the signals will be simply pending.

This, in turn, means that we won't be able to detect the "D" state of the process, container, etc., while on the path to termination, so the loop will continue to spin, exhibiting the same behaviour as the original code currently does.

This is because we only pass the state.InitPid from the Container type each time we look at the process running within the container. Since the first or "init" process would birth the container into existence, it does make sense to check for this specific process only.

Can container storage be unmounted while the process is still running?

When a single process running within the container is due to be terminated and it happens to be in the blocking "D" state, then we would terminate the busy loop early. Then, the following code would run:

https://github.com/cri-o/cri-o/blob/87ff09f9e045797ae43c1a7bde121b6cecb76954/internal/storage/runtime.go#L468-L497

The question here would be: is this safe to run for a process (container) that might still be running for a while?

Feb 20 '24 10:02 kwilczynski

The question here would be: is this safe to run for a process (container) that might still be running for a while?

hard to tell without knowing exactly the stage the oci runtime fails to kill the container, but my guess is we'd get EBUSY on attempting to unmount the storage. I think retrying with an exponential backoff until the process is gone from the process table is the only safe and sustainable solution.

Feb 20 '24 17:02 haircommander

@rphillips: This pull request references Jira Issue OCPBUGS-28981. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to this:

This fixes the busy loop on a failed kill (D-state process) and reschedules the timer to sometime in the future. Since targetTime has expired, the loop re-enters the kill case statement without waiting. The pod is terminal at this point, so a 10ms pause seems ok.

Does this PR introduce a user-facing change?
Fix busy loop on killing a D-state container

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Mar 06 '24 20:03 openshift-ci-robot