ostree icon indicating copy to clipboard operation
ostree copied to clipboard

Cleanup from staged deployments

Open dbnicholson opened this issue 2 years ago • 9 comments

I finally got around to changing our updater to use staged deployments and one thing we lose is pruning of the rollback deployment. Since the ref isn't removed until the new deployment is finalized, the objects are still on disk until some later process prunes. Our updater runs a full cleanup after staging, so the old rollback deployment would get pruned when a new update comes in. However, that may not happen for a long time and it effectively means that you always have 3 deployments on disk.

We could solve this downstream in Endless, but fixing this is something that any user of ostree staged deployments could benefit from. My idea is to have a sysroot autocleanup mode that only runs the cleanup if a known file exists and then delete it when the cleanup completes.

For example, /sysroot/.cleanup is written out when the deployment is finalized. Add an API (or a cmdprivate) that runs ostree_sysroot_cleanup only when /sysroot/.cleanup exists and deletes it when done. Call this from ostree admin cleanup --auto. Add a systemd unit like ostree-sysroot-auto-cleanup.service with ConditionPathExists=/sysroot/.cleanup and ExecStart=/usr/bin/ostree admin cleanup --auto.

WDYT?

dbnicholson avatar Jan 06 '22 16:01 dbnicholson

Hmm. I am not opposed to this, but so far the vision for ostree had been more of as a library. That said this has come up a few times but it would be really nice to try to have more shared daemon code. I've had this offhand thought that we could try to start that daemon code in Rust in ostree-rs-ext?

OTOH, we could also ship some services like this as a build-time (off by default?) option?

The problem I see here is if we suddenly start shipping more on-by-default systemd services we could be interfering with user code.

cgwalters avatar Jan 06 '22 22:01 cgwalters

Right, there are definitely some competing interests here.

  • Any updater daemon may want handle this kind of cleanup themselves and then the systemd service is going to get in the way.

  • The logic to encode this state is really best to live in the staging finalization. When we were discussing this for eos-updater, @wjt suggested the stamp file has to be in /etc so you don't get into a situation where eos-updater created the stamp file, you have an unclean shutdown, and then the cleanup mechanism triggers on the next boot even though you didn't actually write out the deployment yet. Having it in ostree after the call to write out the deployment means it could only exist when a staged deployment has been finalized but not pruned.

  • I had trouble figuring out where to actually put this in eos-updater. Startup? A random idle time? Having an independent service that handles this during boot seemed nice even though it might lock the repo and block eos-updater.

I'm just about done with a PR to implement this, but I think with just the mechanism in place any downstream can decide how to handle it. I.e., if /sysroot/.cleanup (or whatever) is written out during finalization and ostree_sysroot_auto_cleanup exists, then it's trivial to call it. That could be from a daemon or a script in a systemd service or whatever.

dbnicholson avatar Jan 06 '22 22:01 dbnicholson

One thing that has come up in the past too is that in some cases in ostree core, we may want a more generic post-update service which could handle anything that we needed to defer to the next boot. The way I've been thinking of this is that it'd actually run in the initrams, before or after the pivot root. This would allow us to perform "fixups" which could include cleaning.

Having this in the initramfs would avoid any race conditions with update services in the real root.

But, OTOH I think we want to get out the initramfs as fast as possible, and this has the potential to block for a while.

I suspect in your case users would prefer to get a usable desktop as fast as possible after an update, with a GC operation running in the background, rather than block bootup. Which is related to this concern:

Having an independent service that handles this during boot seemed nice even though it might lock the repo and block eos-updater.

I hadn't considered this issue much; for the most part the space leakage from the rollback deployment hasn't been a problem in our cases.

Short term, though given the above issues it seems like it makes the most sense to have higher level code (eos-updater in this case) own this problem?

cgwalters avatar Jan 06 '22 23:01 cgwalters

I had trouble figuring out where to actually put this in eos-updater. Startup? A random idle time?

I think that's just it though - there's a clear need for configurability and control here. And, likely the ability to cancel it. That bit relates to e.g. https://github.com/coreos/rpm-ostree/issues/2969 - rpm-ostree internally has this concept of a single "transaction" operation that can operate on the repo at a time, but it's cancellable. So for rpm-ostree it'd be a much better fit to do this kind of thing internally because then it's more cleanly cancellable. (That said, we could systemctl stop of course too)

It may also help for the desktop use case to make this an explicit "background" operation, i.e. ionice etc. (Though IME the ostree case can be filesystem metadata i.e. journal heavy, which causes contention with other users even though we're niced. xref https://github.com/openshift/machine-config-operator/issues/1897 where I did a ton of investigation into trying to do "background" updates on the openshift control plane nodes which run etcd which really wants all the I/O it can get)

cgwalters avatar Jan 06 '22 23:01 cgwalters

Fair points. I'll post my PR as a proof of concept, but I'll carry on handling this downstream. One thing I realized is that I can entirely emulate this now by adding a drop-in for ostree-finalize-staged.service that just has:

[Service]
ExecStop=-/bin/touch /sysroot/.cleanup

That would run only after ostree admin finalize-staged succeeded. And then we can just add our own systemd unit that runs ostree admin cleanup with ConditionPathExists=/sysroot/.cleanup.

dbnicholson avatar Jan 06 '22 23:01 dbnicholson

My POC is in #2511. Let me know what you think.

dbnicholson avatar Jan 07 '22 17:01 dbnicholson

I'm not intimately familiar with eos-updater flow, so please bear with me if my questions below are imprecise.

I finally got around to changing our updater to use staged deployments and one thing we lose is pruning of the rollback deployment. Since the ref isn't removed until the new deployment is finalized, the objects are still on disk until some later process prunes. Our updater runs a full cleanup after staging, so the old rollback deployment would get pruned when a new update comes in. However, that may not happen for a long time and it effectively means that you always have 3 deployments on disk.

I'm not fully understanding what happens here and what are your concerns.

The 3 deployments should be only present between the time an update is received/staged and the corresponding reboot/finalization happens, correct? Before an update is staged, I'd only expect 1 (or 2, if rollback already exists) to be present, right? After the reboot/finalization, the 3 deployments should rotate back to 2, I think?

If so, that sounds similar to how rpm-ostree works, which allows to both 1) rollback to the previous deployment, or 2) finalize the pending one. Or did I misunderstand that? Which problems does this bring in your context? Which deployment would you want to see disappear, at which point and under which conditions? Would maybe you prefer dropping the rollback one as soon as the new update is staged? Is that closer to your current updater flow (prior to the new staged logic)?

lucab avatar Jan 10 '22 11:01 lucab

Currently the old deployment is deleted and the repo is pruned right after the new deployment is written out with simple_write_deployment. With staged deployments, that all happens at shut down with the design decision not to prune the repo since that could block shut down for a long time.

I agree with that decision, but it means the repo is still holding the objects from that old deployment. Effectively you have 3 commits on disk even though only 2 are actual deployments. You can try this now if you already have 2 deployments. Try pruning the repo immediately after booting into a new deployment and you'll find there are objects pruned even if you haven't pulled anything.

Many of our users are on much lower spec hardware, so they might not have piles of disk space around to waste. Furthermore, many of our users may not actually upgrade that often, so that old old commit may actually be quite different and have a significant number of objects that would be pruned. So for Endless I'd consider it a regression to leave dangling objects from an old OS commit indefinitely.

dbnicholson avatar Jan 10 '22 13:01 dbnicholson

Hmm, here's another half-baked idea: at staging time, we also perform a mock prune where we gather the list of files to be deleted and store it somewhere under /run (as part of the -staged object?). Then at finalization time, we use the list to know what to delete so that we don't have to incur another reachability crawl.

We would need to handle invalidation of the list correctly (e.g. include a state SHA of all refs or something) but it should make the operation much less I/O intensive (though obviously there's still some base cost in deleting files).

jlebon avatar Jan 17 '22 15:01 jlebon