runc icon indicating copy to clipboard operation
runc copied to clipboard

notify_socket.go: use sd_notify_barrier mechanism

Open indyjo opened this issue 2 years ago • 21 comments

Add synchronization to how runc reports container readiness to systemd.

It was previously possible for runc to terminate immediately after sending the READY signal. This was sometimes too quick for systemd to find out the sender's cgroup. As a consequence, systemd would ignore the READY signal.

Fixes: #3293

indyjo avatar Nov 22 '21 09:11 indyjo

Please squash commits

AkihiroSuda avatar Nov 23 '21 17:11 AkihiroSuda

Overall this looks good to me. Is there any way to have a test for this @indyjo ?

kolyshkin avatar Nov 24 '21 19:11 kolyshkin

What I can imagine is to write a test against a mocked NOTIFY_SOCKET server, i.e. a mocked systemd. Concerning how to do it I think I need some guidance. I could try adding a notify_socket_test.go alongside notify_socket.go. I also noticed there are "bats tests" in tests/integration, but I'm completely unfamiliar with that framework. What do you think would be a good approach?

Also, my code is currently assuming that whoever listens on the other end of NOTIFY_SOCKET actually understands and honors the BARRIER=1 protocol. This is certainly the case with recent systemd versions (the feature seems to be about two years old). Other implementations might not. For example, sdnotify-proxy is too simple. Currently, this would lead to runc running into the 5 second timeout and failing to start a container.

Is this behavior a) accepted or should I b) ignore the timeout case or should there be c) a command-line option to suppress the sd_notify_barrier feature?

@kolyshkin @AkihiroSuda

indyjo avatar Nov 25 '21 13:11 indyjo

I have added a test (notify_socket_test.go). The question of how to deal with sd_notify_barrier timeouts is still open. crun waits for 30 seconds and then ignores the error (see PR). systemd-notify waits for 5 seconds and then fails (source).

indyjo avatar Nov 26 '21 14:11 indyjo

@indyjo thanks for the update! Can you please

  • reverse the order of commits (first remove bytes.Buffer usage, then add your changes)
  • add Signed-off-by to both commits

kolyshkin avatar Dec 09 '21 09:12 kolyshkin

  • reverse the order of commits (first remove bytes.Buffer usage, then add your changes)

I'm asking for this because it makes the git log -p smaller (and, consequently, it will be easier to figure out what was changed, especially a few years from now).

kolyshkin avatar Dec 09 '21 09:12 kolyshkin

Thanks for clarifying, @kolyshkin! The two commits are now in the preferred order.

Should we discuss how to handle the case where the supervising process doesn't (correctly) handle the sd_notify_barrier mechanism and runc fails with a timeout? I'm leaning towards simply doing what crun does, although the 30s timeout seems extreme.

indyjo avatar Dec 10 '21 15:12 indyjo

I'm leaning towards simply doing what crun does

I tend to agree.

although the 30s timeout seems extreme.

I guess on an overloaded system 5s might not be enough, so the timeout should be way above the reasonable expected reply time.

But let's ask @giuseppe -- why you ended up with 30s timeout?

kolyshkin avatar Dec 10 '21 17:12 kolyshkin

But let's ask @giuseppe -- why you ended up with 30s timeout?

I've copied the timeout you set here: https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/common.go#L345 :smile:

Although I've decided to ignore any error, so it is equivalent to the version without sd_notify_barrier where it would not raise an error and because I am not sure as well what a correct value would be for the timeout

giuseppe avatar Dec 10 '21 18:12 giuseppe

Committed the proposed solution (as a separate commit for your convenience, @kolyshkin). Wait 30s and warn but ignore on timeout. Just tell me and I'll merge the two commits into one.

indyjo avatar Dec 13 '21 11:12 indyjo

LGTM @indyjo; please squash the last two commits and let's get it merged (after 1.1.0 I guess)

kolyshkin avatar Jan 11 '22 01:01 kolyshkin

Here you go, @kolyshkin, and happy new year!

indyjo avatar Jan 11 '22 07:01 indyjo

@AkihiroSuda PTAL

kolyshkin avatar Jan 25 '22 07:01 kolyshkin

@opencontainers/runc-maintainers PTAL

kolyshkin avatar Feb 16 '22 01:02 kolyshkin

@opencontainers/runc-maintainers @AkihiroSuda PTAL

kolyshkin avatar Mar 08 '22 01:03 kolyshkin

Does it really make sense to implement the barrier-machanism inside of runc?

  1. This will fail if systemd is too old (I guess this does not matter, because it only triggers a warning?)
  2. For non-detached containers, if systemd-notify or any other binary, that uses sd_notify_barrier is used inside of the container it will fail, because the pipe and the barrier message is not relayed to systemd leading to a timeout in sd_notify_barrier.
  3. For detached containers, sd_notify_barrier can or cannot timeout, depending on a race. If I don't miss anything, runc terminates, after it received the ready message (and relayed it to systemd). After the patch it will also do the barrier synchronization. A process using sd_notify_barrier will most likely work with this PR, if it is able to send the barrier message with it's own pipe fd before runc terminates. I think the pipe handles are closed, when runc terminates (or on read). But if runc terminates before it read the BUFFER message and the pipe fd, sd_notify_barrier will timeout.

My implementation idea would be to also relay SCM_RIGHT data and let the client and systemd handle the synchronization transparently. I do not have a real idea, how to implement this for detached containers, because runc now has to run until after barrier-pipe was closed by systemd. runc can only know this, if it intercepts the barrier message and injects it's own pipe. But when can it terminate if no barrier message is sent by the client? Maybe after the timeout?

MofX avatar Mar 14 '22 13:03 MofX

Hi @MofX, let me give my point of view. And sorry about it taking so long.

First of all, whether this patch ends up in an official release or not is not absolutely critical for me. The company I work for, KUKA, maintains their own patched version of runc. We have made good experiences so far, but of course we only test a very limited scenario.

  1. This will fail if systemd is too old (I guess this does not matter, because it only triggers a warning?)

Probably true. The protocol basically mandates the use of the barrier, though. There is no way for a client to query whether it's supported.

Concerning your point 2 and 3:

I don't know enough about runc to understand the implications of detached and detached containers. It is true, however, that systemd-notify inside a container can still fail. That's because this patch doesn't address the barrier mechanism between runc and the container, but between runc and its supervising process.

Actually, we never observed any timeout happening in systemd-notify. The way systemd-notify may fail is that sometimes runc has already closed the connection directly after receiving the READY state, causing a "Connection refused" when systemd-notify tries to establish the barrier. We usually disable the barrier mechanism for this reason, when running systemd-notify in a container, using the --no-block option. That's a separate issue, though.

indyjo avatar Apr 08 '22 12:04 indyjo

@kolyshkin May I ask if there's a chance of including this in v1.1.2?

indyjo avatar May 09 '22 08:05 indyjo

@kolyshkin May I ask if there's a chance of including this in v1.1.2?

The target milestone is set for 1.2.0 (which I hope will be released some time in the summer). We can definitely do a backport to 1.1.x once this is merged, and so it will also be included into latest 1.1.x (whatever that would be).

We need more reviews on this, @opencontainers/runc-maintainers PTAL 🙏🏻

kolyshkin avatar May 09 '22 18:05 kolyshkin

@indyjo can you please rebase it (this will also re-run the CI on golang 1.18)?

kolyshkin avatar May 09 '22 18:05 kolyshkin

@indyjo can you please rebase it (this will also re-run the CI on golang 1.18)?

@kolyshkin Technically, it's done. There's a failing check though: ci / centos-7.

indyjo avatar May 12 '22 07:05 indyjo

Friendly reminder that there's still a chance for this baby to be merged before its first birthday! 🎂 Is there anything I can help with? @kolyshkin @thaJeztah @cyphar @AkihiroSuda

indyjo avatar Nov 12 '22 11:11 indyjo

The last rebase was 6 months ago, could you rebase again to ensure CI is still green? Thanks

AkihiroSuda avatar Nov 12 '22 12:11 AkihiroSuda

Rebased and green. @AkihiroSuda @kolyshkin

indyjo avatar Nov 14 '22 21:11 indyjo

@kolyshkin Just a friendly reminder

indyjo avatar Dec 08 '22 09:12 indyjo