runc
runc copied to clipboard
notify_socket.go: use sd_notify_barrier mechanism
Add synchronization to how runc reports container readiness to systemd.
It was previously possible for runc to terminate immediately after sending the READY signal. This was sometimes too quick for systemd to find out the sender's cgroup. As a consequence, systemd would ignore the READY signal.
Fixes: #3293
Please squash commits
Overall this looks good to me. Is there any way to have a test for this @indyjo ?
What I can imagine is to write a test against a mocked NOTIFY_SOCKET server, i.e. a mocked systemd. Concerning how to do it I think I need some guidance. I could try adding a notify_socket_test.go
alongside notify_socket.go
. I also noticed there are "bats tests" in tests/integration
, but I'm completely unfamiliar with that framework. What do you think would be a good approach?
Also, my code is currently assuming that whoever listens on the other end of NOTIFY_SOCKET actually understands and honors the BARRIER=1
protocol. This is certainly the case with recent systemd versions (the feature seems to be about two years old). Other implementations might not. For example, sdnotify-proxy is too simple. Currently, this would lead to runc
running into the 5 second timeout and failing to start a container.
Is this behavior a) accepted or should I b) ignore the timeout case or should there be c) a command-line option to suppress the sd_notify_barrier feature?
@kolyshkin @AkihiroSuda
I have added a test (notify_socket_test.go). The question of how to deal with sd_notify_barrier timeouts is still open. crun
waits for 30 seconds and then ignores the error (see PR). systemd-notify
waits for 5 seconds and then fails (source).
@indyjo thanks for the update! Can you please
- reverse the order of commits (first remove bytes.Buffer usage, then add your changes)
- add Signed-off-by to both commits
- reverse the order of commits (first remove bytes.Buffer usage, then add your changes)
I'm asking for this because it makes the git log -p smaller (and, consequently, it will be easier to figure out what was changed, especially a few years from now).
Thanks for clarifying, @kolyshkin! The two commits are now in the preferred order.
Should we discuss how to handle the case where the supervising process doesn't (correctly) handle the sd_notify_barrier
mechanism and runc
fails with a timeout? I'm leaning towards simply doing what crun
does, although the 30s timeout seems extreme.
I'm leaning towards simply doing what
crun
does
I tend to agree.
although the 30s timeout seems extreme.
I guess on an overloaded system 5s might not be enough, so the timeout should be way above the reasonable expected reply time.
But let's ask @giuseppe -- why you ended up with 30s timeout?
But let's ask @giuseppe -- why you ended up with 30s timeout?
I've copied the timeout you set here: https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/common.go#L345 :smile:
Although I've decided to ignore any error, so it is equivalent to the version without sd_notify_barrier
where it would not raise an error and because I am not sure as well what a correct value would be for the timeout
Committed the proposed solution (as a separate commit for your convenience, @kolyshkin). Wait 30s and warn but ignore on timeout. Just tell me and I'll merge the two commits into one.
LGTM @indyjo; please squash the last two commits and let's get it merged (after 1.1.0 I guess)
Here you go, @kolyshkin, and happy new year!
@AkihiroSuda PTAL
@opencontainers/runc-maintainers PTAL
@opencontainers/runc-maintainers @AkihiroSuda PTAL
Does it really make sense to implement the barrier-machanism inside of runc?
- This will fail if systemd is too old (I guess this does not matter, because it only triggers a warning?)
- For non-detached containers, if systemd-notify or any other binary, that uses sd_notify_barrier is used inside of the container it will fail, because the pipe and the barrier message is not relayed to systemd leading to a timeout in sd_notify_barrier.
- For detached containers, sd_notify_barrier can or cannot timeout, depending on a race. If I don't miss anything, runc terminates, after it received the ready message (and relayed it to systemd). After the patch it will also do the barrier synchronization. A process using sd_notify_barrier will most likely work with this PR, if it is able to send the barrier message with it's own pipe fd before runc terminates. I think the pipe handles are closed, when runc terminates (or on read). But if runc terminates before it read the BUFFER message and the pipe fd, sd_notify_barrier will timeout.
My implementation idea would be to also relay SCM_RIGHT data and let the client and systemd handle the synchronization transparently. I do not have a real idea, how to implement this for detached containers, because runc now has to run until after barrier-pipe was closed by systemd. runc can only know this, if it intercepts the barrier message and injects it's own pipe. But when can it terminate if no barrier message is sent by the client? Maybe after the timeout?
Hi @MofX, let me give my point of view. And sorry about it taking so long.
First of all, whether this patch ends up in an official release or not is not absolutely critical for me. The company I work for, KUKA, maintains their own patched version of runc
. We have made good experiences so far, but of course we only test a very limited scenario.
- This will fail if systemd is too old (I guess this does not matter, because it only triggers a warning?)
Probably true. The protocol basically mandates the use of the barrier, though. There is no way for a client to query whether it's supported.
Concerning your point 2 and 3:
I don't know enough about runc
to understand the implications of detached and detached containers. It is true, however, that systemd-notify
inside a container can still fail. That's because this patch doesn't address the barrier mechanism between runc and the container, but between runc and its supervising process.
Actually, we never observed any timeout happening in systemd-notify
. The way systemd-notify
may fail is that sometimes runc
has already closed the connection directly after receiving the READY state, causing a "Connection refused" when systemd-notify
tries to establish the barrier. We usually disable the barrier mechanism for this reason, when running systemd-notify
in a container, using the --no-block
option. That's a separate issue, though.
@kolyshkin May I ask if there's a chance of including this in v1.1.2?
@kolyshkin May I ask if there's a chance of including this in v1.1.2?
The target milestone is set for 1.2.0 (which I hope will be released some time in the summer). We can definitely do a backport to 1.1.x once this is merged, and so it will also be included into latest 1.1.x (whatever that would be).
We need more reviews on this, @opencontainers/runc-maintainers PTAL 🙏🏻
@indyjo can you please rebase it (this will also re-run the CI on golang 1.18)?
@indyjo can you please rebase it (this will also re-run the CI on golang 1.18)?
@kolyshkin Technically, it's done. There's a failing check though: ci / centos-7
.
Friendly reminder that there's still a chance for this baby to be merged before its first birthday! 🎂 Is there anything I can help with? @kolyshkin @thaJeztah @cyphar @AkihiroSuda
The last rebase was 6 months ago, could you rebase again to ensure CI is still green? Thanks
Rebased and green. @AkihiroSuda @kolyshkin
@kolyshkin Just a friendly reminder