fleet Spike: Notifications

We want to notify external services about events in Fleet, so users can build workflows around Fleet as a deployer.

For now we are focusing on one type of outgoing notification, a http request, e.g targeted at a webhook endpoint.

Notifications are configured by creating a new custom resource, with fields like:  URL, template for request body, credential references, which events to react to A new controller will watch for notification configs and then generate requests from that config when an event happens.

Research

What events should generate a notification?

gitrepo added
gitrepo deployed
bundle events

Do we need logic to combine events, e.g. only notify if A and B happened? How do other projects deal with notifications? Are there "generic" notifiers we can use, besides http requests?

Sep 18 '24 13:09 manno

@manno https://pkg.go.dev/k8s.io/client-go/tools/record ?

Sep 20 '24 07:09 bigkevmcd

Findings

Scope

An MVP will need to include:

support for sending POST requests to an HTTP(S) endpoint, with a body in plan text or JSON
- this would bring some flexibility, eg. being able to send comments to Github PRs
templates specifying that request body
credentials configuration to authenticate against an HTTP(S) endpoint, eg. referencing a secret:
- basic auth
- token-based auth
simple triggers, eg:
- GitRepo ready across all target clusters
- Bundle deployed to cluster X
- Failing bundle on cluster X
- Cluster X is offline
- (optional) drift for bundle Y on cluster X, in cases where drift correction is disabled
target customisations, enabling users to override GitRepo-level configuration: triggers, destination URL, template, credentials (eg. sending notifications to different URLs, with different bodies, from dev/staging/prod clusters)

Out of scope

composite triggers (eg. condition A and/or condition B)
support for more protocols and specific destination platforms, eg. Slack, OpsGenie

Design

Notifications should be sent:

from the management cluster, which has access to statuses, while agents only deal with deploying workloads. This also makes configuration easier
asynchronously, to prevent delays in reconcile loops → We need a separate controller for notifications, keeping in mind that it needs to support sharding as other Fleet controllers do.

Each notification should store the latest date/time when it was sent, to ease tracking (for users, and later on automated retries) and troubleshooting.

Configuration

Notifications must be configurable at 3 distinct levels:

globally, with settings applying to all workloads, GitRepos, bundles, bundle deployments → useful for existing setups with many workloads, setting defaults for all.
- This could be achieved through a new config map (separate from the existing fleet-controller config map to limit risks of involuntary edits to working Fleet config), which could be reconciled by the existing config reconciler or by a new one.
at workload level, applying to a GitRepo and its targets. Configuration set here would override global settings.
- This could be done by adding a new Notifications field to GitRepos. Predicates could be updated to enable that field to be updated without triggering reconcile loops, as notifications config has nothing to do with altering the state of the cluster.
at bundle deployment level, through target customizations which would then override any configuration set globally and/or at workload level.

Ideas on reconciler implementation

Could a reconciler's predicates read triggers from the different possible configuration sources for notifications (global config map, GitRepo, target customizations) to determine whether or not to run a reconcile loop?
- The reconcile loop itself would then read the rest of the configuration to build and send the notification requests
- Note: this is where storing triggers in metadata, eg. annotations (as done by ArgoCD), could bring a performance benefit through partial metadata lists.

Links - Possibly of interest

https://github.com/azure/gitops-connector

Oct 24 '24 13:10 weyfonk

I don't see any indication of the desired guarantees around notification delivery, e.g. "at least once" or "at most once" or "exactly once"?

For example, GitHub webhooks are not retried (but you can do this manually).

A lot of care needs to be taken to not send notifications out-of-order.

Imagine reporting the deployment of 2dd90c5 before 468c517 when they are ordered differently on main because the original attempt to delivery 468c517 failed the first time.

Some decisions could be made around translating events into hook notifications to upstream services (e.g. Slack, GitHub etc) assuming that in the longer term, Rancher components will all be sending notifications.

Oct 24 '24 14:10 bigkevmcd