grafana-operator
grafana-operator copied to clipboard
[Bug] NotificationChannels not reconciling after Grafana restart
Describe the bug
When the Grafana deployment pod is recreated (either by deletion or eviction), the grafananotificationchannels
are not reconciled by the Operator.
Version v4.5.1
To Reproduce
For easy reproduction, I'm using Minikube + Bitnami Grafana Operator charts (but the behavior is the same in production environments).
The following values.yaml
is used, which installs Grafana with Legacy Alerting, and creates a Pager Duty notification channel for tests:
grafana:
image:
tag: 8.5.9-debian-11-r7
config:
# Ensure LEGACY alerting
alerting:
enabled: true
unified_alerting:
enabled: false
extraDeploy:
- apiVersion: integreatly.org/v1alpha1
kind: GrafanaNotificationChannel
metadata:
name: pager-duty-channel
labels:
app.kubernetes.io/instance: grafana-operator
spec:
name: pager-duty-channel.json
json: >
{
"uid": "pager-duty-alert-notification",
"name": "Pager Duty alert notification",
"type": "pagerduty",
"isDefault": true,
"sendReminder": true,
"frequency": "15m",
"disableResolveMessage": true,
"settings": {
"integrationKey": "put key here",
"autoResolve": true,
"uploadImage": true
}
}
- Start Minikube, and install Grafana Operator
$ minikube start --kubernetes-version='1.24.3'
$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ helm repo update
$ helm install grafana-operator bitnami/grafana-operator \
--namespace grafana-operator --create-namespace \
--version='2.6.10' \
--values=values.yaml
- At the Operator logs, check that the Notification Channel is successfully submitted:
$ kubectl logs grafana-operator-xxxxxx-xxxxx
[...]
1.6596620760552166e+09 INFO running periodic notificationchannel resync
1.6596620761186154e+09 INFO notificationchannel grafana-operator/pager-duty-channel successfully submitted
1.6596620761186965e+09 DEBUG events Normal {"object": {"kind":"GrafanaNotificationChannel","namespace":"grafana-operator","name":"pager-duty-channel","uid":"b14e2e0d-fc16-443e-9f9a-44d078e93731","apiVersion":"integreatly.org/v1alpha1","resourceVersion":"5251"}, "reason": "Success", "message": "notificationchannel grafana-operator/pager-duty-channel successfully submitted"}
- Force the Grafana Deployment pod to be recreated:
$ kubectl delete po grafana-deployment-xxxxxx-xxxxx
- Once the pod is recreated, recheck the Operator logs:
1.6596622458995874e+09 INFO running periodic dashboard resync
1.659662246054992e+09 INFO running periodic notificationchannel resync
1.6596622482120936e+09 DEBUG action-runner ( 0) SUCCESS update admin credentials secret
1.6596622482158809e+09 DEBUG action-runner ( 1) SUCCESS update grafana service
1.659662248218911e+09 DEBUG action-runner ( 2) SUCCESS update grafana service account
1.6596622482220106e+09 DEBUG action-runner ( 3) SUCCESS update grafana config
1.659662248222039e+09 DEBUG action-runner ( 4) SUCCESS plugins unchanged
1.6596622482309968e+09 DEBUG action-runner ( 5) SUCCESS update grafana deployment
1.6596622482310247e+09 DEBUG action-runner ( 6) SUCCESS check deployment readiness
1.6596622482443264e+09 DEBUG grafana-controller desired cluster state met
1.6596622558992486e+09 INFO running periodic dashboard resync
1.6596622560558162e+09 INFO running periodic notificationchannel resync
All eventual grafanadashboards
and grafanadatasources
are recreated, but the Notification Channel is not recreated.
- (Optional) If the
grafananotificationchannel
is recreated, the Operator identifies the change, and submit it again to Grafana:
$ kubectl get -o json grafananotificationchannels pager-duty-channel | kubectl replace --force -f -
grafananotificationchannel.integreatly.org "pager-duty-channel" deleted
grafananotificationchannel.integreatly.org/pager-duty-channel replaced
Screenshots:
Right after installation:
Missing notification channels after Grafana pod recreation:
Recreating the grafananotificationchannel
object reverts to the first screenshot, which is the expected behavior.
Expected behavior
It's expected that the Operator submits all grafananotificationchannels
to Grafana instance when a pod recreation occurs, without the need to recreate the objects.
Suspect component/Location where the bug might be occurring May be related to Legacy Alerting.
Runtime:
- OS: Linux
- Grafana Operator Version: v4.5.1
- Environment: Kubernetes / Minikube (but also reproducible in self-managed production k8s)
- Deployment type: Bitnami Helm Chart
@eduardobaitello thanks for the comprehensive description! From what I can see in the code, there's no logic inside notification channel controller to check if the channel still exists in the grafana instance. So, controller takes no action unless hash of the channel spec changes. Most likely, it's something easy-to-fix as we already have such logic for the dashboard controller. I'll take a closer look at it within a few days.
(UPD): I have a PoC fix, just need some time to polish it.
@weisdd thanks for the feedback!
If there's anything I can help with, just let me know.
@eduardobaitello I've just opened a PR, it's likely to be reviewed next week.
I just tested the v4.6.0 release, and it's working now. Thanks!