grafana-operator
grafana-operator copied to clipboard
[Bug] Reduce event storm when resyncPeriod is low
Describe the bug
Kubernetes is emitting an update event every time the field status.lastResync
is updated in a Dashboard resource. GitOps implementation like ArgoCD will be notified about these update events and will have to check if something was updated.
The ArgoCD and api-server CPU usage goes high when setting resyncPeriod to a low value like 60s and having many dashboards (>50) with large JSON definitions. In case of ArgoCD, the argocd-application-controller will consume CPU and write lots of recurring trivial logs.
The status field in the resources in Kubernetes should not be constantly updated if there are no real update of the resource. I recommend that grafana-operator follow that design pattern.
Version grafana-operator v5.0.1
To Reproduce Deploy all the kube-prometheus-stack dashboards using grafana-operator and watch the events on the Dashboard custom resources.
Expected behavior No Kubernetes update event if resource has not changed.
Suspect component/Location where the bug might be occurring The status.lastResync should probably be kept in-memory instead of writing to the Dashboard custom resource. Or maybe have a Redis or similar database for holding the state. From my point of view, the status could be kept in-memory and the Dashboards would then be re-synced with Grafana when the grafana-operator restarts, which would be okay.
Screenshots N/A
Runtime (please complete the following information):
- Grafana Operator Version v5.0.1
- Environment: Kubernetes v1.27.3
- Deployment type: grafana-operator
- Other: GitOps by ArgoCD
ArgoCD can be configured to ignore certain fields in the yaml: https://argo-cd.readthedocs.io/en/stable/operator-manual/reconcile/#system-level-configuration
Here is an example of how that can be used: https://github.com/argoproj/argo-cd/issues/13534
This is probably not something we can fix on the Operator side, if the resync period elapses, we have to update that timestamp, otherwise we wouldn't know when the next resync is due.
@pb82 won't this be solved when https://github.com/grafana-operator/grafana-operator/pull/1213 is merged? Since we don't perform any update, we shouldn't update the status.
But I do agree with you, ignoring this is probably worth doing any way.
ArgoCD can be configured to ignore certain fields in the yaml: https://argo-cd.readthedocs.io/en/stable/operator-manual/reconcile/#system-level-configuration
Here is an example of how that can be used: argoproj/argo-cd#13534
This is probably not something we can fix on the Operator side, if the resync period elapses, we have to update that timestamp, otherwise we wouldn't know when the next resync is due.
ArgoCD already ignores updates of the status fields in CR, however, ArgoCD has to load the CR and do the comparison. As the Dashboards often are large it takes CPU to process.
My point was that the Grafana Operator can store the timestamps in-memory. It does not have to be stored in the resource in etcd. However, let's see the effect of #1213 and see if that reduces the number of updates.
Storing the timestamps in memory could be a good optimization. We need to check if / how that affects restarts of the operator, as all timestamps would be lost.
This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label
I still suggest that this optimization is implemented as the object status was never designed for this kind of usage.
This is still an issue, please re-open issue.
@ErikLundJensen can you take a look at another design on how this can be solved instead?
A work-a-round could be setting EventRateLimit on Dashboard objects.
The basic problem is that the Grafana Operator mis-uses the status field of the Dashboard objects by writing to it even though nothing has changed.
I have tested again with version 5.4.1 which includes #1213
@NissesSenap Do you want me to come up with a design where last-resync is cached in-memory and eventually written to Dashboard object status every x minute?
Yes something like that. Something that will make it possible for our users to know, preferably looking at the dashboard object, if they have been synced to the Grafana instance but still don't cause issues with event rate limits.
A PR around that would be great.
I just ran in to the same issue. I don't think the operator even has to store it last-resync in memory. Instead, we could do something like this:
import time
def event_loop():
while True:
current_time = int(time.time())
# Check if 30 seconds have passed
if current_time % 30 == 0:
print("30 seconds passed")
# Check if 60 seconds have passed
if current_time % 60 == 0:
print("60 seconds passed")
# Wait for 1 second
time.sleep(1)
event_loop()
And to avoid the issue that all dashboards are synchronized at the same time, we could simply offset by one second for each dashboard uid.
import time
def main():
# List of dashboard uids
# this list must be sorted by creation timestamp to avoid issues when deleting/adding dashboards
dashboard_uids = ['id1', 'id2', 'id3']
while True:
current_time = int(time.time())
# Iterate over each dashboard and its index
for index, uid in enumerate(dashboard_uids):
offset = index + 1
if (current_time + offset) % 30 == 0:
sync_dashboard(uid)
# Wait for 1 second before checking again
time.sleep(1)
main()
(Theoretically, this would "break" if we reach 30 or 31 dashboards but that fine since we have to sync multiple dashboards at the same time if we have more than 30 dashboards with a 30-second resync period)
So in our TODO we are planning to go over of using https://github.com/grafana/grafana-operator/blob/e43aff9dca5e907f71a5139097bd9d0529f2bf78/controllers/controller_shared.go#L100, which should decrease the number of events drastically.
We just haven't had the time to do this for grafanaDashboards. If anyone is up for it we would love a PR.