grafana-operator [Bug] NotificationChannels not reconciling after Grafana restart

Describe the bug When the Grafana deployment pod is recreated (either by deletion or eviction), the grafananotificationchannels are not reconciled by the Operator.

Version v4.5.1

To Reproduce

For easy reproduction, I'm using Minikube + Bitnami Grafana Operator charts (but the behavior is the same in production environments).

The following values.yaml is used, which installs Grafana with Legacy Alerting, and creates a Pager Duty notification channel for tests:

grafana:
  image:
    tag: 8.5.9-debian-11-r7
  config:
    # Ensure LEGACY alerting
    alerting:
      enabled: true
    unified_alerting:
      enabled: false

extraDeploy:
  - apiVersion: integreatly.org/v1alpha1
    kind: GrafanaNotificationChannel
    metadata:
      name: pager-duty-channel
      labels:
        app.kubernetes.io/instance: grafana-operator
    spec:
      name: pager-duty-channel.json
      json: >
        {
          "uid": "pager-duty-alert-notification",
          "name": "Pager Duty alert notification",
          "type":  "pagerduty",
          "isDefault": true,
          "sendReminder": true,
          "frequency": "15m",
          "disableResolveMessage": true,
          "settings": {
            "integrationKey": "put key here",
            "autoResolve": true,
            "uploadImage": true
        }
        }

Start Minikube, and install Grafana Operator

$ minikube start --kubernetes-version='1.24.3'

$ helm repo add bitnami https://charts.bitnami.com/bitnami

$ helm repo update

$ helm install grafana-operator bitnami/grafana-operator \
  --namespace grafana-operator --create-namespace \
  --version='2.6.10' \
  --values=values.yaml

At the Operator logs, check that the Notification Channel is successfully submitted:

$ kubectl logs grafana-operator-xxxxxx-xxxxx
[...]
1.6596620760552166e+09	INFO	running periodic notificationchannel resync
1.6596620761186154e+09	INFO	notificationchannel grafana-operator/pager-duty-channel successfully submitted
1.6596620761186965e+09	DEBUG	events	Normal	{"object": {"kind":"GrafanaNotificationChannel","namespace":"grafana-operator","name":"pager-duty-channel","uid":"b14e2e0d-fc16-443e-9f9a-44d078e93731","apiVersion":"integreatly.org/v1alpha1","resourceVersion":"5251"}, "reason": "Success", "message": "notificationchannel grafana-operator/pager-duty-channel successfully submitted"}

Force the Grafana Deployment pod to be recreated:

$ kubectl delete po grafana-deployment-xxxxxx-xxxxx

Once the pod is recreated, recheck the Operator logs:

1.6596622458995874e+09	INFO	running periodic dashboard resync
1.659662246054992e+09	INFO	running periodic notificationchannel resync
1.6596622482120936e+09	DEBUG	action-runner	(    0)    SUCCESS update admin credentials secret
1.6596622482158809e+09	DEBUG	action-runner	(    1)    SUCCESS update grafana service
1.659662248218911e+09	DEBUG	action-runner	(    2)    SUCCESS update grafana service account
1.6596622482220106e+09	DEBUG	action-runner	(    3)    SUCCESS update grafana config
1.659662248222039e+09	DEBUG	action-runner	(    4)    SUCCESS plugins unchanged
1.6596622482309968e+09	DEBUG	action-runner	(    5)    SUCCESS update grafana deployment
1.6596622482310247e+09	DEBUG	action-runner	(    6)    SUCCESS check deployment readiness
1.6596622482443264e+09	DEBUG	grafana-controller	desired cluster state met
1.6596622558992486e+09	INFO	running periodic dashboard resync
1.6596622560558162e+09	INFO	running periodic notificationchannel resync

All eventual grafanadashboards and grafanadatasources are recreated, but the Notification Channel is not recreated.

(Optional) If the grafananotificationchannel is recreated, the Operator identifies the change, and submit it again to Grafana:

$ kubectl get -o json grafananotificationchannels pager-duty-channel | kubectl replace --force -f -
grafananotificationchannel.integreatly.org "pager-duty-channel" deleted
grafananotificationchannel.integreatly.org/pager-duty-channel replaced

Screenshots:

Right after installation: Screenshot from 2022-08-04 22-17-15

Missing notification channels after Grafana pod recreation: Screenshot from 2022-08-04 22-19-42

Recreating the grafananotificationchannel object reverts to the first screenshot, which is the expected behavior.

Expected behavior It's expected that the Operator submits all grafananotificationchannels to Grafana instance when a pod recreation occurs, without the need to recreate the objects.

Suspect component/Location where the bug might be occurring May be related to Legacy Alerting.

Runtime:

OS: Linux
Grafana Operator Version: v4.5.1
Environment: Kubernetes / Minikube (but also reproducible in self-managed production k8s)
Deployment type: Bitnami Helm Chart

Aug 05 '22 02:08 eduardobaitello

@eduardobaitello thanks for the comprehensive description! From what I can see in the code, there's no logic inside notification channel controller to check if the channel still exists in the grafana instance. So, controller takes no action unless hash of the channel spec changes. Most likely, it's something easy-to-fix as we already have such logic for the dashboard controller. I'll take a closer look at it within a few days.

(UPD): I have a PoC fix, just need some time to polish it.

Aug 06 '22 11:08 weisdd

@weisdd thanks for the feedback!

If there's anything I can help with, just let me know.

Aug 08 '22 23:08 eduardobaitello

@eduardobaitello I've just opened a PR, it's likely to be reviewed next week.

Aug 12 '22 12:08 weisdd

I just tested the v4.6.0 release, and it's working now. Thanks!

Aug 22 '22 18:08 eduardobaitello

grafana-operator grafana-operator copied to clipboard

[Bug] NotificationChannels not reconciling after Grafana restart

grafana-operator
grafana-operator copied to clipboard