controller-runtime Support global timeouts for reconcilers

I ran into an issue today where one of my reconcilers was unexpectedly deadlocking on a particular resource. This appeared to halt on processing of that reconciler for other resources, essentially freezing the system until I restarted the manager binary (where it would eventually freeze again when it got to the bad resource).

It would be nice if I could configure a timeout that would apply to all reconcilers in the manager such that they would automatically abort and back off it they appear to be deadlocked.

Feb 13 '20 20:02 ekuefler

The idea of a global timeout is definitely interesting and we can probably make it opt-in. I've seen these kinds of situations when the resources weren't enough, most common is deployments with less than 2-3 CPUs allocated.

Feb 15 '20 15:02 vincepri

Sounds good. My particular situation wasn't related to starving the CPU. I have a resource representing an external server that I need to do TCP health checks on. I'd failed to call SetDeadline on my connection, so when I got a server that responded to the connect request but failed to write back any data, the reconciler would block indefinitely on that resource and lock out processing for all the others. A global timeout would be a useful safety net here.

Feb 16 '20 01:02 ekuefler

/priority important-soon /kind design /help

Feb 21 '20 17:02 vincepri

@vincepri: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/priority important-soon /kind design /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 21 '20 17:02 k8s-ci-robot

they would automatically abort and back off it they appear to be deadlocked

I'm little bit confused, what this issue is about. You want some timeout mechanism that would terminate reconciler automatically(i.e. without user handling this termination in reconciliation logic)?

Feb 24 '20 19:02 GrigoriyMikhalkin

That's right. Many web frameworks have a built-in timeout that will terminate a request if it takes too long, ensuring request threads aren't occupied indefinitely. I would imagine something similar here. In kubebuilder's case, it would force the reconciler to return an error.

Here's the wrapper I wrote and am using in my project to do this (it also handles panics per #797):

func MakeSafe(r reconcile.Reconciler) reconcile.Reconciler {
	return safeReconciler{impl: r}
}

type safeReconciler struct {
	impl reconcile.Reconciler
}

func (r safeReconciler) Reconcile(request reconcile.Request) (reconcile.Result, error) {
	type response struct {
		result reconcile.Result
		err    error
	}

	ch := make(chan response)
	go func() {
		defer func() {
			if p := recover(); p != nil {
				ch <- response{
					err: fmt.Errorf("panic: %v [recovered]\n\n%s", p, debug.Stack()),
				}
			}
		}()

		r, e := r.impl.Reconcile(request)
		ch <- response{
			result: r,
			err:    e,
		}
	}()

	select {
	case r := <-ch:
		return r.result, r.err
	case <-time.After(10 * time.Second):
		return reconcile.Result{}, fmt.Errorf("reconciler timed out")
	}
}

This is imperfect since it means we'll leak a goroutine if the wrapped reconciler deadlocks, but it will at least guarantee that kubebuilder's worker goroutines can't deadlock.

Feb 26 '20 08:02 ekuefler

IMO, it looks like solution to conceal some bugs in program, instead of fixing them.

Feb 26 '20 14:02 GrigoriyMikhalkin

It is not possible to terminate a goroutine from the outside, so I don't know how this could be implemented (leaking goroutines is IMHO not an option we should consider, it also means that we break the guarantee of "Only one routine will work at a given key at a given time").

Or did I misunderstand something about this?

Mar 23 '20 02:03 alvaroaleman

This is really tricky as there is no way to terminate a goroutine which executes indefinitely. I otherwise like the code that @ekuefler pasted above and hope we could do something in that spirit.

Adding contexts to Reconcile would allow us to signal to the goroutine that it should stop doing things (and then we'd wait a bit longer for it to "cleanup & return" before timing out), but it doesn't guarantee that the goroutine respects the context.

A poor "solution" would be to swap out the underlying client that the reconcile call uses to a client which just directly errors once the timeout has passed (also to make sure that deadlocked goroutine doesn't do any harm), but that definitely has drawbacks / side-effects.

Jul 14 '20 10:07 luxas

A poor "solution" would be to swap out the underlying client that the reconcile call uses to a client which just directly errors once the timeout has passed (also to make sure that deadlocked goroutine doesn't do any harm), but that definitely has drawbacks / side-effects.

This would help a little, but I'd imagine many reconcilers are going to make calls to external systems which would not use an enlightened client. Eventually, a user will end up in the same situation where they have unbounded execution.

IMO, the controller author must have a context available to them at the root of execution.

I would go a step further than just having a context on reconcilers, but also mapping functions should include a context argument to bound their execution as well. One good example of unbounded execution in a handler is the following extracted from Cluster API. As you can see the context.Background() passed into client.List(...) is problematic.

// ClusterToObjectsMapper returns a mapper function that gets a cluster and lists all objects for the object passed in
// and returns a list of requests.
// NB: The objects are required to have `clusterv1.ClusterLabelName` applied.
func ClusterToObjectsMapper(c client.Client, ro runtime.Object, scheme *runtime.Scheme) (handler.Mapper, error) {
	if _, ok := ro.(metav1.ListInterface); !ok {
		return nil, errors.Errorf("expected a metav1.ListInterface, got %T instead", ro)
	}

	gvk, err := apiutil.GVKForObject(ro, scheme)
	if err != nil {
		return nil, err
	}

	return handler.ToRequestsFunc(func(o handler.MapObject) []ctrl.Request {
		cluster, ok := o.Object.(*clusterv1.Cluster)
		if !ok {
			return nil
		}

		list := &unstructured.UnstructuredList{}
		list.SetGroupVersionKind(gvk)
		if err := c.List(context.Background(), list, client.MatchingLabels{clusterv1.ClusterLabelName: cluster.Name}); err != nil {
			return nil
		}

		results := []ctrl.Request{}
		for _, obj := range list.Items {
			results = append(results, ctrl.Request{
				NamespacedName: client.ObjectKey{Namespace: obj.GetNamespace(), Name: obj.GetName()},
			})
		}
		return results

	}), nil
}

context.Context should probably be a first parameter on anything that calls into controller author code.

If a breaking change is to be made, I'd argue it would be best to make one sweeping change rather than multiple small breaking changes.

Jul 14 '20 14:07 devigned

context.Context should probably be a first parameter on anything that calls into controller author code.

Yes

Also see https://github.com/kubernetes-sigs/controller-runtime/issues/801#issuecomment-658110635, in particular:

The manager could also e.g. close the ctx given to the reconciler loop upon a Ctrl-C signal. We could just say "if Reconcile doesn't respect the context and deadlocks, you're violating the contract", with the tradeoff that if a reconcile loop doesn't follow the contract, there might be multiple executing reconcile loops at the same time (optionally; guaranteed to never run two Reconcile loops twice for the same NamespacedName, for the example of faulty resources)

Jul 14 '20 14:07 luxas

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Oct 12 '20 15:10 fejta-bot

/lifecycle frozen

Oct 12 '20 15:10 vincepri

The changes around context above have been implemented, to have cancellation for reconcilers we'll need to inject a context with a set timeout — although that doesn't really guarantee that operations will be cancelled so there is potentially a way to leak goroutines

Oct 12 '20 15:10 vincepri