operator icon indicating copy to clipboard operation
operator copied to clipboard

Support scoped event handler statuses

Open knkski opened this issue 3 years ago • 12 comments

Setting self.model.unit.status across multiple event handlers doesn't work in charms. The Operator Framework needs to present some better way of handling this, or else charm authors are forced to write one event handler that responds to every event.

A simple example of the issue:

  • Event handler A sets BlockedStatus
  • Event handler B sets BlockedStatus
  • Event handler A runs successfully, sets ActiveStatus
  • Whoops, now event handler B's BlockedStatus got lost

See discussion here for more mitigation attempts, but anything yet proposed ultimately just ends up shoving the one-event-handler-to-rule-them-all into something else, such as _is_ready or update_status helper functions.

See also this discussion, which is about relation-level statuses. Allowing relation-level statuses is a pragmatic solution, because it doesn't solve the general case, but does solve the case of one application install event handler, plus N relation event handlers. This pattern applies to most or all of the K8s charms written today.

As a concrete example of what the title means, this decorator allows for an event handler to only care about its own status, without having to worry about the state of the rest of the application:

def handle_status_properly(f):
    """Updates status after decorated event handler is run.

    WARNING: For demonstration purposes only. Does not persist statuses.
    """
    STATUSES = {}

    @wraps(f)
    def wrapper(self, event):
        try:
            status = f(self, event)
        except Exception as err:
            status = BlockedStatus(str(err))

        if status is None:
            if f.__name__ in STATUSES:
                del STATUSES[f.__name__]
        else:
            STATUSES[f.__name__] = status

        if STATUSES:
            status_type = type(list(STATUSES.values())[0])
            self.model.unit.status = status_type('; '.join([st.message for st in STATUSES.values()]))
        else:
            self.model.unit.status = ActiveStatus()
    return wrapper


class Operator(CharmBase):
    def __init__(self, *args):
        super().__init__(*args)

        self.framework.observe(self.on.install, self.install)
        self.framework.observe(self.on["istio-pilot"].relation_changed, self.install)
        self.framework.observe(self.on.config_changed, self.install)

    @handle_status_properly
    def install(self, event):
        if not self.unit.is_leader():
            return WaitingStatus("Waiting for leadership")

        if self.model.config['kind'] not in ('ingress', 'egress'):
            return BlockedStatus('Config item `kind` must be set')

        if not self.model.relations['istio-pilot']:
            return BlockedStatus("Waiting for istio-pilot relation")

        try:
            pilot = get_interface(self, "istio-pilot")
        except NoVersionsListed as err:
            return WaitingStatus(str(err))

        if not pilot.get_data():
            return BlockedStatus("Waiting for istio-pilot relation data")

        pilot = list(pilot.get_data().values())[0]

        env = Environment(loader=FileSystemLoader('src'))
        template = env.get_template('manifest.yaml')
        rendered = template.render(
            kind=self.model.config['kind'],
            namespace=self.model.name,
            pilot_host=pilot['service-name'],
            pilot_port=pilot['service-port'],
        )

        subprocess.run(["./kubectl", "apply", "-f-"], input=rendered.encode('utf-8'), check=True)

Note that any other event handler that runs and wants to set ActiveStatus has to ensure that all of that code ran successfully, which is equivalent to just running that code, and puts us back into one event handler to rule them all.

knkski avatar Nov 04 '21 14:11 knkski

As a real-world example, alertmanager-operator attempted to have different event handlers for each event. Though the charm was well-written, that approach ended up requiring shoving almost all of the logic into self._common_exit_hook, and almost all of the event handlers are bare calls to that helper function.

The single block of code in that charm that is not in a shared event handler and that isn't just logging looks like this:

self._stored.config_hash = (
    ""
    if not self.container.can_connect()
    else sha256(yaml.safe_dump(yaml.safe_load(self.container.pull(self._config_path))))
)

That raises the question of what happens if self.container.pull() fails, or self.container.can_connect() returns False? self.model.unit.status should be updated to reflect that, but then we run into the situation detailed above where different event handlers can erase each other's BlockedStatus. That block of code should be moved to self._common_exit_hook, at which point we've arrived at one event handler to rule them all.

knkski avatar Nov 04 '21 16:11 knkski

Thank you for the well and thoroughly written bug.

I agree that scoped statuses would be highly useful. I'll sketch out a design, and see when we can slot the work into our roadmap.

pengale avatar Nov 04 '21 20:11 pengale

I'm not completely convinced by this; it feels like something that can be solved quite simply by charmers. Having a separate method that performs the relative checks to ensure an application is ready before starting/restarting seems like an obvious way forward here.

There have been comments made in the past about "one event handler to rule them all". I think as a general/default pattern it should be used with care. Equally, if the application is such that having a single method that understands how to validate and start an application makes sense then I don't see it as a big problem?

I think the important thing is that charmers are mindful of the semantic differences between, for example, pebble-ready, start, config-changed, etc. and ensure that they are handled appropriately for the application in question.

There is a related discussion to this which explores some alternative solutions.

@jameinel @javacruft may also have opinions here, and I'm keen to get more input to ensure I'm not missing something, but the above proposal seems like it might introduce confusion as to how/when statuses are set in a lot of cases.

jnsgruk avatar Nov 05 '21 13:11 jnsgruk

After sleeping on this, and talking to some folks about this, I have more thoughts.

Mostly, I'm a little concerned about charms routinely setting, then unsetting a "blocked" status.

If a charm sets a "blocked" status, that means it has run into a situation that needs human intervention. If we have charms that are routinely setting blocked status, then unsetting it when something else happens, we are not using that status correctly, and should probably do something else.

That doesn't meant that there aren't gaps in OF (or Juju) support that are temping charm authors into overloading the blocked status. But the path forward here is to fill in those gaps, rather than making it easier to use blocked where something else would be more appropriate :-)

pengale avatar Nov 05 '21 14:11 pengale

I think a common pattern is very much that you might be able to think of some of the logic locally, but ultimately all sources of logic end up funneling into one or a small number of final functions. Whether that is because the status of the unit is derived from the status of all of its potential relations, or because the config of the final application is a sum total of all of the relations and the configuration of the unit. I think there are a few potential designs that can be useful here. Whether that is our discussion of a Dependency pattern (call this handler when both A and B are true). Or something that would provide something similar for statuses (each of these objects can set its idea of unit status, and then this other bit of logic gets the chance to combine all of that into the ultimate unit status.) Having something like a map of {'name': Status} instead of just a simple attribute would at least let each relation/logical unit have a place to put their information. But we do want something that does some sort of "when anything changes its local status, re-evaluate the global status". There are lots of ways we could model this, a rough example of something like StatusCollector object could be:

self.collector = StatusCollector(self._status_aggregator)
...
def _on_db_changed(self, event):
  if not ready:
    self.collector.set_status(name, Blocked, message)

def _status_aggregator(self):
  status = Active(...)
  for name, sub_status in self.collector.statuses:
    if sub_status is not None:
      status = Blocked
      message = message + sub_status.message

There are some issues of resolving (if one sub-status is Maintenance but another is Blocked, the result should be Blocked, but otherwise it is Maintenance, etc)

On Fri, Nov 5, 2021 at 9:24 AM Jon Seager @.***> wrote:

I'm not completely convinced by this; it feels like something that can be solved quite simply by charmers. Having a separate method that performs the relative checks to ensure an application is ready before starting/restarting seems like an obvious way forward here.

There have been comments made in the past about "one event handler to rule them all". I think as a general/default pattern it should be used with care. Equally, if the application is such that having a single method that understands how to validate and start an application makes sense then I don't see it as a big problem?

I think the important thing is that charmers are mindful of the semantic differences between, for example, pebble-ready, start, config-changed, etc. and ensure that they are handled appropriately for the application in question.

There is a related discussion https://github.com/canonical/istio-operators/pull/40 to this which explores some alternative solutions.

@jameinel https://github.com/jameinel @javacruft https://github.com/javacruft may also have opinions here, and I'm keen to get more input to ensure I'm not missing something, but the above proposal seems like it might introduce confusion as to how/when statuses are set in a lot of cases.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/canonical/operator/issues/665#issuecomment-961892109, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABRQ7J4QQJWIOIX2TB2KJTUKPLJPANCNFSM5HLRZ6AA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jameinel avatar Nov 05 '21 14:11 jameinel

In Ken's case, I think they run into 'validating the relation data' which may require a human to upgrade the charm to a newer version, which would then unblock that relation, but doesn't want to invalidate/overwrite a different relation stating that it also needs attention. Certainly you shouldn't go into Blocked casually, but you may want to report Maintenance or Waiting if you have the relation but the other side hasn't finished the conversation yet. (You're missing the relation to a frontend, eg Blocked, but the database that you are related to hasn't informed you of its password yet (Waiting). When the DB is happy, you still want to be in blocked.)

On Fri, Nov 5, 2021 at 10:08 AM Pen Vander Giessen @.***> wrote:

After sleeping on this, and talking to some folks about this, I have more thoughts.

Mostly, I'm a little concerned about charms routinely setting, then unsetting a "blocked" status.

If a charm sets a "blocked" status, that means it has run into a situation that needs human intervention. If we have charms that are routinely setting blocked status, then unsetting it when something else happens, we are not using that status correctly, and should probably do something else.

That doesn't meant that there aren't gaps in OF (or Juju) support that are temping charm authors into overloading the blocked status. But the path forward here is to fill in those gaps, rather than making it easier to use blocked where something else would be more appropriate :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/canonical/operator/issues/665#issuecomment-961924936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABRQ7JXM4B353Z7KWDSY7DUKPXNFANCNFSM5HLRZ6AA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jameinel avatar Nov 05 '21 14:11 jameinel

fwiw, alertmanager has a _common_exit_hook because all alertmanager units need to know the address of at least one other unit, and prometheus needs to know them all (over relation data):

  1. the combination of bug/1929364 and bug/1933303 requires charm code to hold off of some actions (upload layer and start service) before something else happens (OF sees the ip address is assigned to the unit).
  2. using defer in multiple places creates a race condition
  3. a complete use-case pattern, using defer + reemit, is not possible: reemit() is blocking + events stack, which causes: RuntimeError: two objects claiming to be AlertmanagerCharm/on/start[16] have been created.
  4. one "mega-event" helps a lot with idempotency
  5. BONUS: either way, I still find I have to rely on update_status anyway to complete the startup sequence (on a low resource host, OF still doesn't see the unit IP address after all startup events fired - you can still see occasionally in alertmanager or prometheus integration tests)

I would love to see a good example of a complex charm that does not use a "main event gate".

sed-i avatar Nov 05 '21 14:11 sed-i

Due to the complexity involved in charming stuff up, and in an attempt to keep things simple (dependency pattern is not simple), afaik, "one event to rule them all" is a very good tradeoff and just works, every time. Again, I will be very happy to be proven wrong on this one.

sed-i avatar Nov 05 '21 14:11 sed-i

@petevg The use of BlockedStatus above just makes the problem more apparent. A WaitingStatus that never resolves is equivalent to a BlockedStatus, the charm just can't prove it.

Even if the WaitingStatus does resolve, the above scenario still illustrates incorrect behavior, it just turns it into a race condition. For example:

  • Event handler A sets WaitingStatus
  • Event handler B sets WaitingStatus
  • Event handler A runs successfully, sets ActiveStatus
  • User tries to interact with the charm, but it's actually broken
  • User files a bug or complains about broken charm
  • Event handler B finally resolves

knkski avatar Nov 08 '21 17:11 knkski

@petevg

If a charm sets a "blocked" status, that means it has run into a situation that needs human intervention. If we have charms that are routinely setting blocked status, then unsetting it when something else happens, we are not using that status correctly, and should probably do something else.

The "human intervention" (or at least the notification of such for the charm) will fundamentally have to come in the form of another Juju hook or action event, at which point the charm would need to change it to something else, which may be "whatever the previous status was", however that is determined. Same for waiting status, except that it's driven by a workload change or hook event.

johnsca avatar Nov 10 '21 17:11 johnsca

@sed-i I agree that "get up and running" part of the life-cycle will require a single handler that may take inputs from several or many different Juju hook events, but once the charm is in that state, there are definitely cases where subsequent hooks will be handled differently. It's much less common in K8s charms because of the nature of how most images use the "restart to reconfigure" pattern, but even then there are some which can respond to updated file-mapped config (which would typically be handled by K8s via ConfigMaps but could be directly written by the charm when using the sidecar approach, I think).

There's some aspect of charms having "states" or "modes of operation" (e.g., "pre-installed", "installed but not running", "running in singleton mode", "running in HA mode", etc) and some of the most complex charms might benefit from splitting the flow logic for those cases out into, say, separate classes. But while I think that might help make the status setting issue a bit more manageable, it wouldn't address all of the issues.

johnsca avatar Nov 10 '21 17:11 johnsca

For the records, we have prototyped a charm lib providing a CompoundStatus object that can be used to just this. That specific implementation turned out to be unnecessarily complicated, but the lessons learned from there resulted in a spec document (OP020) detailing how a solution should look like. Hopefully in the next cycle we'll get around to implementing it in ops.

PietroPasotti avatar Oct 13 '22 13:10 PietroPasotti

To summarize (some may contradict others):

  • The visibility of debug-log is insufficient / not ergonomic enough so it cannot be used in place of unit's status.
  • Charm code should not read status for control flow (stateless, idempotent). Perhaps the ops api should have a setter but no getter for the status?
  • Each relation (and each side of a relation) could benefit having its own status.
  • Currently there is no means to unset a status. When a check passes it doesn't mean that some deeply nested code can now set to ActiveStatus.
  • In "common exit hook" we don't need namespaced status becuase flow is flat and sequential in such a function (but "common exit hook" pattern obviously has other issues).
  • Pietro's compond status is great but no longer maintained? It would be great to have some basic functionality for this in ops.
  • Following some sort of a "dependency pattern" could possibly obviate the entire need.

cc: @benhoyt

sed-i avatar Jan 16 '23 16:01 sed-i

Just noting that we want to spend some time to design this properly and have put it on our roadmap for the upcoming cycle.

benhoyt avatar Mar 19 '23 21:03 benhoyt

Just moving some notes I've had on my personal to-do list here for the record.

For multiple blocked statuses, we could do one of several approaches:

  1. Change Juju to have the concept of a BlockedStatus "context" (a short string), and there can be multiple blocked statuses. "juju status" would show the first one (say alphabetically) with +3 or whatever if there were more, then you could drill in via "juju show-unit".
  2. Change ops to allow multiple BlockedStatus with a context (and probably priority), save this in the model on the controller (state-set, state-get), then have an atexit type thing which sets Juju status to the overall status if it needs to.
  3. Do the above, but as a charm lib, not part of of ops core.
  4. Each charm just "calculates" its overall status at the end of each hook and sets the final Juju status. Add helpers to ops to make this simpler/standardized.

Ian and Harry lean towards not doing it in Juju, and lean towards calculating -- when we try to store things in state they can easily become stale.

See also: https://github.com/canonical/istio-operators/pull/40

benhoyt avatar Mar 19 '23 21:03 benhoyt

Might be stating the obvious, but it would be nice if ops could set the Message field of relations, similar to how it is already able to set the Message of units.

Relation provider                   Requirer                     Interface              Type     Message
alertmanager:alerting               loki:alertmanager            alertmanager_dispatch  regular  
alertmanager:alerting               prometheus:alertmanager      alertmanager_dispatch  regular  
alertmanager:grafana-dashboard      grafana:grafana-dashboard    grafana_dashboard      regular  
alertmanager:grafana-source         grafana:grafana-source       grafana_datasource     regular  

sed-i avatar Mar 26 '23 04:03 sed-i

This has been solved with the multi-status feature -- the new collect_app_status and collect_unit_status events now address this exact problem. Present in ops 2.5.0.

benhoyt avatar Sep 29 '23 01:09 benhoyt