Dependency event triggers don't handle lookup errors

Open mdbooth opened this issue 11 months ago • 1 comments

Dependency provides a WatchEventHandler implementation for the dependency which it manages. The purpose of this is to watch the Dependency and trigger events for all objects which depend on it. This requires a 'reverse lookup'.

For example, take Network and Subnet. Subnet has a networkRef, but Network does not have a corresponding subnetRef. The subnet controller needs to watch networks so it can trigger a reconcile when one becomes available, but as the triggering network has no back-reference it must do a lookup using the field index it created:

https://github.com/k-orc/openstack-resource-controller/blob/0a186e2e8823570beaff683d0bdf611cd3eb3d02/internal/util/dependency/dependency.go#L163-L167

The problem here is that the lookup can fail, but event handlers cannot return errors. We log any error which occurs, but ignore it because we have no way to handle it here. If any error did occur here we would potentially have lost a triggering event which might leave a reconciliation waiting for it hanging.

There are a couple of mitigating factors here. Firstly, in practise this lookup is unlikely (impossible?) to fail, because it will be an in-memory lookup from the informer cache. I'm uncomfortable with making this sort of tightly coupled assumption, but it might be argument against creating a theoretically more robust, but much more complex workaround.

Secondly, even if this was doing an actual API call which could fail, it's failure would likely correlate with an error causing a resync of the informer cache, which would itself trigger another opportunity to fire the event handler.

So while unsatisfying, it may not be worth fixing in practice. It might be worth bringing this up in the controller-runtime community to see if there's an existing idiom for handling this.

Feb 05 '25 11:02 mdbooth

https://github.com/kubernetes-sigs/controller-runtime/issues/1996

and some slack discussion: https://kubernetes.slack.com/archives/C02MRBMN00Z/p1738761791434629

tl;dr There is no current solution to this, but my assumption that a lookup from the informer cache can't fail (at least after initialisation) is probably correct, so we don't necessarily need to panic over this.

Feb 05 '25 15:02 mdbooth