grafana-operator [Bug] Reduce event storm when resyncPeriod is low

Describe the bug Kubernetes is emitting an update event every time the field status.lastResync is updated in a Dashboard resource. GitOps implementation like ArgoCD will be notified about these update events and will have to check if something was updated.

The ArgoCD and api-server CPU usage goes high when setting resyncPeriod to a low value like 60s and having many dashboards (>50) with large JSON definitions. In case of ArgoCD, the argocd-application-controller will consume CPU and write lots of recurring trivial logs.

The status field in the resources in Kubernetes should not be constantly updated if there are no real update of the resource. I recommend that grafana-operator follow that design pattern.

Version grafana-operator v5.0.1

To Reproduce Deploy all the kube-prometheus-stack dashboards using grafana-operator and watch the events on the Dashboard custom resources.

Expected behavior No Kubernetes update event if resource has not changed.

Suspect component/Location where the bug might be occurring The status.lastResync should probably be kept in-memory instead of writing to the Dashboard custom resource. Or maybe have a Redis or similar database for holding the state. From my point of view, the status could be kept in-memory and the Dashboards would then be re-synced with Grafana when the grafana-operator restarts, which would be okay.

Screenshots N/A

Runtime (please complete the following information):

Grafana Operator Version v5.0.1
Environment: Kubernetes v1.27.3
Deployment type: grafana-operator
Other: GitOps by ArgoCD

Aug 28 '23 10:08 ErikLundJensen

ArgoCD can be configured to ignore certain fields in the yaml: https://argo-cd.readthedocs.io/en/stable/operator-manual/reconcile/#system-level-configuration

Here is an example of how that can be used: https://github.com/argoproj/argo-cd/issues/13534

This is probably not something we can fix on the Operator side, if the resync period elapses, we have to update that timestamp, otherwise we wouldn't know when the next resync is due.

Aug 29 '23 11:08 pb82

@pb82 won't this be solved when https://github.com/grafana-operator/grafana-operator/pull/1213 is merged? Since we don't perform any update, we shouldn't update the status.

But I do agree with you, ignoring this is probably worth doing any way.

Aug 30 '23 07:08 NissesSenap

ArgoCD can be configured to ignore certain fields in the yaml: https://argo-cd.readthedocs.io/en/stable/operator-manual/reconcile/#system-level-configuration

Here is an example of how that can be used: argoproj/argo-cd#13534

This is probably not something we can fix on the Operator side, if the resync period elapses, we have to update that timestamp, otherwise we wouldn't know when the next resync is due.

ArgoCD already ignores updates of the status fields in CR, however, ArgoCD has to load the CR and do the comparison. As the Dashboards often are large it takes CPU to process.

My point was that the Grafana Operator can store the timestamps in-memory. It does not have to be stored in the resource in etcd. However, let's see the effect of #1213 and see if that reduces the number of updates.

Aug 30 '23 10:08 ErikLundJensen

Storing the timestamps in memory could be a good optimization. We need to check if / how that affects restarts of the operator, as all timestamps would be lost.

Sep 12 '23 11:09 pb82

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

Oct 15 '23 01:10 github-actions[bot]

I still suggest that this optimization is implemented as the object status was never designed for this kind of usage.

Oct 23 '23 13:10 ErikLundJensen

This is still an issue, please re-open issue.

Oct 24 '23 10:10 ErikLundJensen

@ErikLundJensen can you take a look at another design on how this can be solved instead?

Oct 24 '23 10:10 NissesSenap

A work-a-round could be setting EventRateLimit on Dashboard objects.

The basic problem is that the Grafana Operator mis-uses the status field of the Dashboard objects by writing to it even though nothing has changed.

I have tested again with version 5.4.1 which includes #1213

@NissesSenap Do you want me to come up with a design where last-resync is cached in-memory and eventually written to Dashboard object status every x minute?

Oct 24 '23 12:10 ErikLundJensen

Yes something like that. Something that will make it possible for our users to know, preferably looking at the dashboard object, if they have been synced to the Grafana instance but still don't cause issues with event rate limits.

A PR around that would be great.

Oct 24 '23 12:10 NissesSenap

I just ran in to the same issue. I don't think the operator even has to store it last-resync in memory. Instead, we could do something like this:

import time

def event_loop():
    while True:
        current_time = int(time.time())
        
        # Check if 30 seconds have passed
        if current_time % 30 == 0:
            print("30 seconds passed")
        
        # Check if 60 seconds have passed
        if current_time % 60 == 0:
            print("60 seconds passed")
        
        # Wait for 1 second
        time.sleep(1)

event_loop()

And to avoid the issue that all dashboards are synchronized at the same time, we could simply offset by one second for each dashboard uid.

import time

def main():
    # List of dashboard uids
    # this list must be sorted by creation timestamp to avoid issues when deleting/adding dashboards
    dashboard_uids = ['id1', 'id2', 'id3']
    
    while True:
        current_time = int(time.time())
        
        # Iterate over each dashboard and its index
        for index, uid in enumerate(dashboard_uids):
            offset = index + 1
            if (current_time + offset) % 30 == 0:
                sync_dashboard(uid)
        
        # Wait for 1 second before checking again
        time.sleep(1)

main()

(Theoretically, this would "break" if we reach 30 or 31 dashboards but that fine since we have to sync multiple dashboards at the same time if we have more than 30 dashboards with a 30-second resync period)

Apr 15 '24 07:04 Skaronator

So in our TODO we are planning to go over of using https://github.com/grafana/grafana-operator/blob/e43aff9dca5e907f71a5139097bd9d0529f2bf78/controllers/controller_shared.go#L100, which should decrease the number of events drastically.

We just haven't had the time to do this for grafanaDashboards. If anyone is up for it we would love a PR.

Apr 15 '24 08:04 NissesSenap

grafana-operator grafana-operator copied to clipboard

[Bug] Reduce event storm when resyncPeriod is low

grafana-operator
grafana-operator copied to clipboard