rpm-ostree icon indicating copy to clipboard operation
rpm-ostree copied to clipboard

daemon: Add socket activation via /run/rpm-ostreed.socket

Open cgwalters opened this issue 2 years ago • 13 comments

This is a re-submit of https://github.com/coreos/rpm-ostree/pull/2932 daemon: Add socket activation via /run/rpm-ostreed.socket

For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls ostree_sysroot_load(). We also end up scanning the RPM database, and actually parse all the GPG keys in /etc/pki/rpm-gpg etc.

This causes two related problems:

  • By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s.
  • If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout.

The solution to this is to adopt systemd socket activation. There's a new rpm-ostreed.socket and the daemon can be activated that way.

The client (when run as root, the socket is only accessible to root right now) connects, which will activate the daemon and attempt initialization - without claiming the DBus name yet.

If something goes wrong here, the daemon will reply to the client that activated it with the error, and then also exit with failure.

On success, everything continues as normal, including claiming the DBus name.

Note that this also logically replaces the code that does explicit systemctl start rpm-ostreed invocations.

After this patch:

$ systemctl stop rpm-ostreed
$ umount /boot
$ rpm-ostree status
error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found, but no /boot/loader directory

cgwalters avatar Jul 22 '22 18:07 cgwalters

This blocks on SELinux policy updates: https://bugzilla.redhat.com/show_bug.cgi?id=2110012

cgwalters avatar Jul 22 '22 18:07 cgwalters

OK, I now made this into a build-time option. (Though looking at it, I also need to make installation of the socket unit a build-time option)

This will allow us to ship this in F37 now (with a matching specfile conditional) and gain experience with it.

Leaving as draft though for now since this is a nontrivial change and could use some more design thought before merging.

cgwalters avatar Jul 27 '22 17:07 cgwalters

/test all

cgwalters avatar Jul 29 '22 13:07 cgwalters

Dang it, I did a whole port to https://crates.io/crates/uds before discovering https://github.com/tormol/uds/issues/6

For now...I switched back to using worker threads. That said...hmmm. What would probably make sense here is to factor out an async-sock-seqpacket crate (that internally uses worker threads, but actually exposes a tokio channel that takes serde-serializable types) ...we've now got very similar code in all 3 of:

  • https://github.com/coreos/bootupd/blob/ff9555afad3120eade3d7d4611eecd8391ae21fc/src/ipc.rs#L46
  • https://github.com/containers/containers-image-proxy-rs/blob/a4223a034895bf27423878f3af2ba22d50fc7b36/src/imageproxy.rs#L96
  • this project

I guess one can tell that I like SOCK_SEQPACKET?

cgwalters avatar Jul 29 '22 19:07 cgwalters

If we want to support a non-DBus API, one thing we'll likely need to do for now is bridge requests from the tokio thread to the glib/DBus daemon thread. We can do this by acquiring that worker thread's main context and then invoking https://docs.rs/glib/latest/glib/struct.MainContext.html#method.spawn

cgwalters avatar Jul 29 '22 19:07 cgwalters

OK C9S and RHEL9.1 should include the necessary selinux policy changes for this, so I think this can be considered for review and landing.

Looks like there's a CI failure in the Jenkins/vmcheck bits I need to dig into.

But right now more problematically the socket path isn't being tested by CI at all; need to add a f37 build or extend our c9s build to enable this feature and actually do tests too.

cgwalters avatar Aug 05 '22 12:08 cgwalters

@cgwalters: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot avatar Aug 16 '22 23:08 openshift-merge-robot

I've been looking in the background at https://github.com/mitsuhiko/tokio-unix-ipc/ and have two issues open, though neither are blocking.

cgwalters avatar Sep 12 '22 13:09 cgwalters

We could also look at https://capnproto.org/ for the IPC protocol

cgwalters avatar Jul 31 '23 15:07 cgwalters

OK I made a few more changes here just to clean things up:

  • Squashed all the commits
  • Removed now-spurious whitespace changes leftover from the removed feature flag
  • Changed you to be the author, listed me as co-author

cgwalters avatar Aug 14 '23 21:08 cgwalters

The jenkins pipeline shows error: cleanup: Loading sysroot: Starting daemon via socket: Connecting: No such file or directory (os error 2). Does the rpm-ostreed.socket file in /etc/systemd/system need to be enabled?

RishabhSaini avatar Aug 16 '23 17:08 RishabhSaini

I think what's going on here is:

We include the .socket unit in the RPM, but it's not enabled by default because it's not in the preset list.

This has burned us before. In theory, what we'd need to do is add it to that list. However doing so will dramatically slow down our ability to ship this.

So what I'd propose is:

  • Change the client code to check if it's running as root, and if so systemctl start rpm-ostreed.socket explicitly
  • Continue to improve and land this PR
  • Once it lands, add a PR to fedora-release to enable the socket by default
  • Eventually, drop the client auto-starting

Hmm I think we may also need to add %systemd_post for our units.

cgwalters avatar Aug 16 '23 17:08 cgwalters

OK will pick this back up in the background; I made a few more tweaks. Let's see what CI says.

cgwalters avatar Aug 23 '23 15:08 cgwalters

Thanks for the help!

RishabhSaini avatar Aug 23 '23 15:08 RishabhSaini