moby container: add a timeout for deletion

On a heavily loaded host, we were experiencing long container(d) deletion times:

containerd[8907]: time="2024-03-25T13:47:45.938479195Z" level=debug msg="event forwarded" ns=moby topic=/tasks/exit type=containerd.events.TaskExit
# our control plane logic deletes the successfully exited container via
# the docker API, and...
containerd[8907]: time="2024-03-25T13:47:47.202055216Z" level=debug msg="failed to delete task" error="context deadline exceeded" id=a618057629b35e3bfea82d5ce4cbb057ba979498496428dfe6935a1322b94add

Before 4bafaa00aa81 ("Refactor libcontainerd to minimize c8d RPCs") when this happens, the docker API reports a 255 exit code and no error:

0a7ddd027c0497d5a titus-executor-[900884]: Processing msg from a docker: main container exited with code 255

which is especially confusing. After 4bafaa00aa81, the behavior has changed to report the container's real exit code, although there is still a hard coded timeout after which containerd will (try to) stop cleaning up. We would like to wait for this cleanup, so let's add a user configurable DeleteTimeout here.

Reported-by: Hechao Li [email protected]

Apr 12 '24 14:04 tych0

Hmm, I'm not sure I understand the failure here,

=== Failed
=== FAIL: amd64.integration-cli TestDockerAPISuite/TestExecStateCleanup (10.43s)
    docker_api_exec_test.go:225: assertion failed: expected an error, got nil
    --- FAIL: TestDockerAPISuite/TestExecStateCleanup (10.43s)

it looks like it tries to submit and kill an invalid exec request, I don't think this code should be invoked on that path?

Apr 12 '24 16:04 tych0

Commenting from my phone, but 👋 👋👋 good seeing you! Hope you're doing well ☺️

Apr 12 '24 16:04 thaJeztah

Thanks, I am! This PR is not urgent, but it would be nice to understand the failures here.

Apr 12 '24 17:04 tych0

This doesn't seem like the right place to me. Deletion issues are not really part of the makeup of a container. In terms of terminology this is also somewhat ambiguous at the API level what DeleteTimeout means, because this is not about deleting the container, just the underlying containerd task.

You mention you want to wait for delete, what exactly do you mean by this, and how are you looking at it today?

Apr 25 '24 17:04 cpuguy83

Discussed this with other maintainers yesterday and seeing where the existing behavior can be problematic:

Right now we have a timeout on containerd task deletion, which on a loaded node could be hit. That's ok, but only if the same container is restarted again because before it is started the task will be deleted... it must be otherwise we can't start the container. I think this is even done without a timeout.

Where the real problem lies is if the timeout is hit and then you remove the (dockerd) container, the task will be leaked and effectively have a zombie task, shim process, and containerd container.

Some observations in the existing code:

We don't actually need to collect the exit code from the task delete, we get that from the containerd event.
We also should not be processing that delete in the event loop, as this blocks other events from being processed.
Container removal (in dockerd) should block until the task is deleted (which, as mentioned before, is not happening today).
Having a timeout on task delete seems (mostly) counter-productive since it must always be deleted. Especially when the system is under high load.

Apr 26 '24 17:04 cpuguy83

Do you want the daemon to wait indefinitely until the task has been deleted so as not to leak c8d tasks, or do you want the delete API call to block until the task has been deleted so that your orchestrator knows when cleanup is complete? If it's the latter, this PR is insufficient as there are timeouts on the waits for the container to leave the Running state in the API handler code paths. I'm going to assume you want the former. (Note that the daemon only transitions a container from the Running state to Stopped until after the task has been deleted – or the delete times out.)

Leaking c8d tasks on heavily loaded systems (or runtimes which are slow to delete tasks) is bad for everyone. I would ideally want to find a solution which does not require any configuration, by actually waiting indefinitely. The daemon has a separate event queue for each (c8d) container, so blocking indefinitely handling one event will not block events for other containers from being handled. The trouble is, the Docker container's state is only transitioned from Running to Stopped after the delete operation returns, and until then the container continues to be reported as Running in the Engine API. And a container can't be restarted until after the old c8d task has been deleted, so we can't just transition the container to Stopped immediately upon receiving the c8d TaskExit event, before deleting the task. I think that for waiting indefinitely to be workable in practice, the Docker container state machine will need a new "Exited" (Stopping?) state for that time when the task has exited but not yet been deleted. Exited containers could then be reported as such when inspected. And more importantly, we could allow users to leak the c8d tasks on demand (releasing the Docker container resources) by making that the behaviour when force-deleting a container when it is in the Exited state. Maybe it would also be useful for non-force delete API request to block until the container has transitioned out of the Exited state?

Apr 26 '24 18:04 corhere

We also should not be processing that delete in the event loop, as this blocks other events from being processed.

@cpuguy83 I checked into that -- only other events for the same Container ID are blocked. Each Container ID gets its own separate event queue inside libcontainerd and each event callback is invoked from a new goroutine. The existing implementation where the event handler callback blocks until handleContainerExit returns is correct.

Apr 26 '24 19:04 corhere

This behavior was much more pathological before this commit; it used to report a 255 exit code when this timed out, now it no longer does. i.e.:

We don't actually need to collect the exit code from the task delete, we get that from the containerd event.

this statement is only true after the above commit. Before, it was /really/ bad :)

Do you want the daemon to wait indefinitely until the task has been deleted so as not to leak c8d tasks, or do you want the delete API call to block until the task has been deleted so that your orchestrator knows when cleanup is complete?

Ideally both, but the most important thing is to not leak c8d tasks. After the commit I linked, the behavior is fine as Brian noted.

I would ideally want to find a solution which does not require any configuration, by actually waiting indefinitely.

Agreed. I can change this PR to reflect that if everyone else agrees?

May 06 '24 17:05 tych0

Ping, just want to follow up here and see what people think.

May 20 '24 14:05 tych0

@tych0 Sorry for the delay. Yes, I think that's a good path forward.

May 21 '24 14:05 cpuguy83

Ok, I dropped all the other code for letting users set timeouts and just made it context.Background().

May 30 '24 14:05 tych0

Simple and straightforward, LGTM! @tych0 please update the PR description and commit message to reflect the updated PR content.

May 30 '24 19:05 corhere

sure, commit message is updated, but i can copy paste that to the PR description as well.

May 30 '24 20:05 tych0

The shortlog for commit 345eb82cf6c3ebd3f0f448441235c208ee5cd5fa still reads container: add a timeout for deletion

May 30 '24 20:05 corhere

oh, yes, it sure does!

May 30 '24 20:05 tych0