controller-runtime icon indicating copy to clipboard operation
controller-runtime copied to clipboard

Controller Checkpointing

Open austince opened this issue 6 months ago • 12 comments

I'm thinking about checkpointing controller state to blob storage for some of our high-load controllers, where starting from the snapshot (even with leader elected) can lead to long startup times.

There seems to be a similar issue in https://github.com/kubernetes-sigs/controller-runtime/issues/728, but wasn't properly spec'd.

I'm thinking about this on a per-controller basis, optimistically storing:

  • Latest source watch bookmark
  • Current workqueue

On startup, optimistically load the previous state, configure the source with the last bookmark, and reload the workqueue.

The restrictions:

  • On the user to indicate when to invalidate checkpoints and start from snapshot (e.g., when controller logic changes)
  • Don't record the requeue times per item in the workqueues
  • Don't snapshot the informer state

In https://github.com/kubernetes-sigs/controller-runtime/issues/728#issuecomment-571352448, the suggestion was to use a separate Runnable. This works for the Start case (easy to wrap and load state), but not for the shutdown case, as we need to ensure the runnable is no longer running before checkpointing the state. Hooking into the Manager also doesn't seem to be an option as it exits Start(ctx) as soon as the context is cancelled, and also doesn't provide access to the underlying runnables.

Controllers do wait to return from Start(ctx) until they are fully finished https://github.com/kubernetes-sigs/controller-runtime/blob/5f5daf39228530d7c5ed8f54e9a80c0e4528c9f6/pkg/internal/controller/controller.go#L216 The least intrusive way I think we could expose this is via an opt-in extension to the Runnable interface, e.g. FinishableRunnable

// FinishableRunnable is a Runnable that can indicate when it has finished running.
type FinishableRunnable interface {
	Runnable
	Done() <-chan struct{}
}

In the manager, it could block until all such runnables are complete (or until an error is reported via errChan).

https://github.com/kubernetes-sigs/controller-runtime/blob/5af1f3ebd472b62a4a708ad3aa2d252489b91b27/pkg/manager/internal.go#L468-L469

Does anyone have any other less-intrusive ideas?

I'd be open to contributing the full checkpointing solution, but would like to see if this can be done in user-space first. I'd briefly talked with @alvaroaleman about this, though he's out for a few more months IIRC.

austince avatar Jun 09 '25 19:06 austince

Seems also related to https://github.com/kubernetes-sigs/controller-runtime/issues/3221

austince avatar Jun 09 '25 19:06 austince

I think this is an implementation detail of the source and should not leak into anything else. Lats time I checked, the factoring of the informers didn't make it possible to inject a custom store but I am assuming you are more interested in custom sources anyways?

alvaroaleman avatar Jun 09 '25 19:06 alvaroaleman

I am assuming you are more interested in custom sources anyways?

Yeah correct -- in our custom source impl, I was thinking about encapsulating this to pull its own state, and not having the controller configure it explicitly, just kick off the loading of the checkpoint.

austince avatar Jun 09 '25 19:06 austince

Yeah correct -- in our custom source impl, I was thinking about encapsulating this to pull its own state, and not having the controller configure it explicitly, just kick off the loading of the checkpoint.

Why would the controller do anything specific here as opposed to the source doing this within its Start()?

alvaroaleman avatar Jun 09 '25 19:06 alvaroaleman

Why would the controller do anything specific here as opposed to the source doing this within its Start()?

That makes sense, yeah we could leave it entirely to the source.

EDIT: though thinking more, we need to allow the user to invalidate all state around a controller when logic changes. Can probably still encapsulate sources, but think controller needs play a role (as that is what users usually configure).

austince avatar Jun 09 '25 19:06 austince

I assume #3192 does not work for your case to speed up?

sbueringer avatar Jun 10 '25 06:06 sbueringer

#3192 will be good to try, though some of our controllers take about 30 minutes to warm up (build some inmem data structures for all watched resources) and put large load on the server (in our case, this also leads to high cost as we back our API server with CSP DBs). If we can successfully offload to blob storage, I think we can solve both latency and cost.

austince avatar Jun 10 '25 11:06 austince

EDIT: though thinking more, we need to allow the user to invalidate all state around a controller when logic changes. Can probably still encapsulate sources, but think controller needs play a role (as that is what users usually configure).

I do think this functionality is generally a pretty big footgun, reliably telling if a controller has changed or not is very difficult as its not only the code but also potentially config. This is also the only case where someone has asked about this, so I am currently not in favor of adding this here into upstream.

If I really wanted this feature and had a custom builder, I would probably consider adding it there.

alvaroaleman avatar Jun 20 '25 15:06 alvaroaleman

This is also the only case where someone has asked about this, so I am currently not in favor of adding this here into upstream.

That makes a lot of sense, +1. I'd also rather hammer out the pain internally first. I'd been thinking about this problem in the space of Apache Flink, where this is a huge pain point. While they have a transparent "logic ID" generator based on the function hash, it also causes problems:

If you do not specify the IDs manually they will be generated automatically. You can automatically restore from the savepoint as long as these IDs do not change. The generated IDs depend on the structure of your program and are sensitive to program changes. Therefore, it is highly recommended assigning these IDs manually.

I don't think the case of "generating a new hash that drops state when logic hasn't actually changed" matters as much in controller cases, as it wouldn't be used for correctness (we could always drop and restart from empty), so a "source-code+configuration based hash" in a custom builder sounds like a reasonable place to start.

@alvaroaleman, are you supportive of adding the required hooks in controller-runtime to allow this to be reliably done in user space?

austince avatar Jun 20 '25 16:06 austince

@alvaroaleman, are you supportive of adding the required hooks in controller-runtime to allow this to be reliably done in user space?

What hooks are you talking about? As I mentioned earlier, you can implement this in your sources.

alvaroaleman avatar Jun 20 '25 17:06 alvaroaleman

IIUC, we'd need:

  • A way to block process shutdown while certain runnables are still running
  • A way to access all sources for a controller

austince avatar Jun 20 '25 17:06 austince

Sorry I don't understand:

A way to block process shutdown while certain runnables are still running

Nothing in controller-runtime keeps you from keeping your binary running?

A way to access all sources for a controller

Why would you need that? Implement it in the source, be done, why would the controller need to know anything about this?

alvaroaleman avatar Jun 24 '25 20:06 alvaroaleman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 22 '25 20:09 k8s-triage-robot

/close

We will come back with benchmarks :)

austince avatar Sep 22 '25 21:09 austince

@austince: Closing this issue.

In response to this:

/close

We will come back with benchmarks :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Sep 22 '25 21:09 k8s-ci-robot