kopf icon indicating copy to clipboard operation
kopf copied to clipboard

Consider status Stanza Updates As Triggers For State-Changing Handlers By Default

Open paxbit opened this issue 3 years ago • 0 comments

Problem

After implementing the hotfix offered in #686 of aligning the handler id=, I noticed some operator handlers were not firing for certain events. I was kind of lucky discovering this because it was only triggered by a premature container death (caused by a config error) of a pod managed by kopf. This particular handler was using a combination of on.resume(...) and on.update(...).

Debugging this led me astray for quite some time because the behavior of the missing handler invocation could be changed with an average probability by setting non-suspending log breakpoints in certain places, like e.g. in kopf.reactor.registries.ResourceChangingRegistry.iter_handlers. It could however reliably be suppressed by setting suspending breakpoints to begin stepping through the code. This always made the handler being callled. So to me it looked like a timing/threading/async issue or a combination of those. I thought those breakpoint delays are making it work and also that I'm dealing with a kind of heisenbug. Silly me :)

To cut an otherwise lengthy story short, here's what I found:

Beforehand: The handler not being called was a handler interested in the pod status.containerStatuses stanza, to detect premature container deaths.

  1. kopfs implementation is deliberately excluding status from the the decision making process when figuring out if a resource change event satisfies a handler invocation requirement. This is actually documented in code here: https://github.com/nolar/kopf/blob/1.29.2/kopf/storage/diffbase.py#L33 In the rendered documentation there is this note:

    Worth noting that Kopf stores the status of the handlers, such as their progress or errors or retries, in the object itself (.status stanza), which triggers its own low-level events, but these events are not detected as separate causes, as there is nothing changed essentially.

  2. Before applying the id= alignment hotfix from #686 the touch-dummy "noise" on a pod in question did eventually trigger the handler by circumstance through metadata updates as a side-effect. The breakpoints I mention above were allowing the handler to be called through introducing enough delay for a (still-applied-just-not-hammering-anymore) touch-dummy to intervene with a meta update.

  3. The actual MODIFIED event of the status.containerStatuses update was always filtered out. #686 was masking that fact up until now.

  4. After realizing the above I tried setting on.update(..., field='status'). It worked, status is now considered an essential change which leads to diffs.diff(old, new) not returning empty handed, which in turn leads to the created resource_changing_cause being of Reason.UPDATE instead of Reason.NOOP.

  5. In the rendered docs field= is undocumented for state changing handlers. It is documented for on.field(...) handlers though.

Proposals

Change the Default Behavior

I'd argue, at least in the case of pod resources, status is of the essence as well and should always be considered for handler triggering after MODIFIED events. My personal opinion is that a pods spec should be treated as immutable after creation. Following this line of thinking would make it an anti-pattern to patch a pod after creation. In this scenario there can be no MODIFIED event on a pod spec ever. However even if one allows her-/himself to patch spec.containers as-you-go style I'd say it is still a way more common use case for an operator to observe a pods life-cycle through looking at its status. So IMO at least for pods it should be included by default. Then, API-wise I find it somehow counter-intuitive status changes do not trigger. The reason is that it allows (and actually encourages to do so via argument unpacking) a handler decorated like so:


@kopf.on.update(...)
async def the_handler(status, **_):
    # do sth. useful, then...
    return

Looking at this I would expect the handler being notified to deal with status updates. However the handler will only ever see a status after some metadata or spec update, as a side-effect.

I do understand that on.field(...) and on.event(...) handlers do no share this kind of "status-blindness". And I assume the reason for the current situation was that, historically, status was the handler status storage location for a resource and patching kopfs own handler states into it was supposed to not trigger spurious handler calls b/c hey, why would user code be interested in those internals. So any change towards the proposal should probably consider implementing some kind of secondary filtering logic making sure that an on.update(...) handler of a pod is only triggered for status updated not concerned with kopf handler states. But maybe only if a progress storage implementation is configured that actually writes to status. So I guess for 'AnnotationProgressStorage' this does not apply.

Update the Documentation

After I understood the situation I thought the code comment is very clear but basically invisible until stumbled upon. If including status for changes by default can or should not be implemented for some reason at least the docs could be clearer about the situation. Yes the info box in the docs I quoted above does note it however without the knowledge I have now I read it as specifically only applying to kopfs own handler status storage while implicitly not applying to any other status change.

Checklist

  • [x] Many users can benefit from this feature, it is not a one-time case
  • [x] The proposal is related to the K8s operator framework, not to the K8s client libraries

paxbit avatar Mar 16 '21 13:03 paxbit