conmon icon indicating copy to clipboard operation
conmon copied to clipboard

[podman] conmon should restart dead child processes

Open tobwen opened this issue 3 years ago • 10 comments

What's the issue?

When killing slirp4netns, a pod or a container keeps on running without warnings, but without networking.

How to reproduce?

podman pod create --name systemd-pod
podman create --pod systemd-pod alpine top
podman create --pod systemd-pod alpine top
podman pod start systemd-pod
pkill -U tobwen 'slirp4netns'

What's expected?

  1. conmon (or podman) should take care of the child processes and restart them if they crash or die.
  2. A notification in the logs would also be nice.

What's the environment?

podman version 3.3.0-dev conmon version 2.0.30-dev

tobwen avatar Jun 15 '21 14:06 tobwen

@mheon @giuseppe is it even possible to restart slirp4netns in this case? I would imagine there'd be some runtime state that would be lost. I would expect slirp4netns dying to kill the container tbh

haircommander avatar Jun 15 '21 15:06 haircommander

I think you'd lose active connections, but you'd lose those on the container going down too. I don't really think you can do the restart straight from Conmon, though. We really need access to the full container definition from the Podman DB to proceed.

mheon avatar Jun 15 '21 16:06 mheon

I think you'd lose active connections, but you'd lose those on the container going down too

Sure, but without a heartbeat or another check, a user wouldn't get informed about this. Can't we get a log entry at least?

tobwen avatar Jun 15 '21 16:06 tobwen

Log entry is definitely viable. Container being killed is also viable. We could probably do slirp restart, but it'd require a fair bit of hacking - we'd need to be able to pass in a command for Conmon to run on slirp exit that is different and distinct from the exit command.

mheon avatar Jun 15 '21 17:06 mheon

With a normal systemd-setup, a gracefully killed container would restart, so would slirp. Sounds well :-)

tobwen avatar Jun 15 '21 17:06 tobwen

@mheon @giuseppe is it even possible to restart slirp4netns in this case? I would imagine there'd be some runtime state that would be lost. I would expect slirp4netns dying to kill the container tbh

I don't think conmon should know about slirp4netns.

IMO, slirp4netns should be seen as infrastructure for the container. Killing slirp4netns is equivalent to dropping the iptables rules for root containers or killing fuse-overlayfs when it is used for rootless.

giuseppe avatar Jun 17 '21 07:06 giuseppe

Oops, I forgot to add fuse-overlayfs in my post.

Killing, of course, was an edge case, of course. I just wanted to simulate: When happens, if slirp4netns or fuse-overlayfs crash by itself. Will the container heal itself, will there be logs, etc.

So it's even fine when the container gets stopped (or restarted). But an entry in the logs would be fine, so the admin could react.

tobwen avatar Jun 17 '21 08:06 tobwen

we could move slirp4netns to a separate cgroup (or at least make it configurable) so that systemd could report the failure. I'd not worry about fuse-overlayfs since we are moving to use the native overlay support for rootless as well.

giuseppe avatar Jun 17 '21 11:06 giuseppe

I would love to see conmon kill the container if slirp4netns and/or fuse-overlayfs exited and exit with an error state Then it would be up to podman or systemd to decide if the pod/container should restart.

Could we potentially do this by passing pidfds to conmon, and then having conmon wait on those pids, if they exit, then conmon throws an error.

rhatdan avatar Jun 18 '21 14:06 rhatdan

I like the pidfd idea a lot.

mheon avatar Jun 18 '21 14:06 mheon