operator icon indicating copy to clipboard operation
operator copied to clipboard

Defer important status messages until commit?

Open stub42 opened this issue 4 years ago • 2 comments

A common problem with charms has been ensuring workload_status contains the most important information, and not overwritten. For example, ensuring that if charm code puts a unit into a 'blocked' status then later code does not overwrite it with a 'maintenance' message or worse 'active' status.

With the operator framework, it seems the common cause of this will be deferred events. For example, a handler for a relation-changed event may need to set the status to 'waiting' because the other end is not yet ready. Other events may also be emitted (in this hook or subsequent), and their handlers may overwrite a more important 'waiting' or 'blocked' status with maintenance messages, and the end user ends up unaware why their deployment has hung.

An approach used by a charms.reactive layer (layer-status) was to only set 'maintenance' status immediately. Maintenance status allowed the charm to inform the user about what operations it is doing. If the charm set blocked, waiting or maintenance then the message stored instead of emitted via status-set. At the end of the hook, the most important message was emitted. Blocked messages had highest priority, followed by waiting, and active only emitted if no blocked or waiting messages were set by the charm. I found this approach to work very well, pushing for its use to ensure problems with charm deployments were always visible, without a lot of complexity or tight coupling. It was also the only sane way to ensure that workload_status states set by shared code did not get obliterated by charm code. Shared code could set a blocked status, and know it or another blocked status got escalated to the end user. Without it, shared code had to instead fail and put the charm into an error state.

I'm currently looking at the following implementation to add this behavior to my operator framework charm, and thought I'd open the issue here in case the operator framework could provide similar behavior to all charms If provided by the operator framework, then shared code such as relation implementations could also rely on the behavior, rather than fighting with the charm for control of the workload_status:

class MyCharm(ops.charm.CharmBase):
    def __init__(self, *args):
        super().__init__(*args)
        self.framework.observe(self.framework.on.pre_commit, self)

    _active_msg = None
    _block_msg = None
    _wait_msg = None

    def set_active(self, msg):
        # Active status is deferred until commit, overriding maintenance only.
        self._active_msg = msg

    def set_blocked(self, msg):
        # Blocked status is deferred until commit, overriding all other status
        self._block_msg = msg

    def set_waiting(self, msg):
        # Waiting status is deferred until commit, overriding maintenance and active
        self._wait_msg = msg

    def set_maintenance(self, msg):
        # Maintenance status is emitted immediately, informing end user of charm activity
        self.unit.status = ops.model.MaintenanceStatus(msg)

    def update_status(self):
        if self._block_msg is not None:
            self.unit.status = ops.model.BlockedStatus(self._block_msg)
        elif self._wait_msg is not None:
            self.unit.status = ops.model.WaitingStatus(self._wait_msg)
        elif self._active_msg is not None:
            self.unit.status = ops.model.ActiveStatus(self._active_msg)

stub42 avatar Mar 11 '20 05:03 stub42

I recognize the problem and we had conversations about it last month, and we're still looking for a good solution. You provide some interesting ideas we hadn't discussed before, but I'm not sure about how much we can use that concept in an environment where code is owned by multiple parties that do not coordinate.

For example, It seems strange to me to make a blocked status override anything else, when you have code evaluating the status and finding out that it got unblocked after the initial change was made. This only seems logical if you're the one making that assumption in the first place.. for someone else it will probably seem arbitrary that code run in sequence cannot behave in sequence.

In our conversations we also discussed how much sense it would make for library code to be touching the charm status. It seems hard for an individual piece that is imported into a charm to understand whether a particular situation is cause for a status change or is being worked around by the charm logic itself in different ways.

One idea we pondered about would be raising an error in particular cases where we cannot obtain the requested data (say, imagine a mysql.get_database method being called without a database connected), and including in the error a suggested status change as an attribute next to a proper error message explaining the error. This way the code remains idiomatic, the library doesn't touch the status, and it is convenient for the charm author to take such an error and transform it into a status change.

We're still exploring, though.. happy to have a call on this to exchange ideas with more bandwidth, if you're interested.

niemeyer avatar Mar 13 '20 10:03 niemeyer

I think that raising exceptions is a good way to abort execution and escalate issues. The framework could set the status if unhandled, and the charm has opportunity to catch the exception if it instead prefers to handle it. I'm currently experimenting with this approach to have subroutines defer events, signaling that they don't have enough state to continue. In my case, the subroutine is often setting the status before raising the DeferEvent exception. It would probably be better to adopt your suggestion, raising an exception containing suggestions (blocked status, try again later).

stub42 avatar Mar 13 '20 10:03 stub42

This is going to be solved by "multi status" or "component status with priority" that we're going to work on this cycle (23.10).

benhoyt avatar Apr 28 '23 11:04 benhoyt

We're considering this solved by the new collect-status events released in ops 2.5.0

benhoyt avatar Sep 21 '23 03:09 benhoyt