alloy
alloy copied to clipboard
Mimir.rules.kubernetes keeps deleting and re-creating rules on grafana cloud
What's wrong?
I am using the block mimir.rules.kubernetes
with the latest version of the grafana agent (flow mode) docker.io/grafana/agent:v0.40.3
.
It uploads my PrometheusRule to Grafana Cloud remote mimir instance, but from the UI I can see my alerts being constantly deleted and then recreated, where it alternates between the 3 states. Here are 3 screenshots:
Steps to reproduce
- Install the agent using helm in a Kubernetes cluster
- use this in the values.yaml:
extraConfig: |-
// documentation: https://grafana.com/docs/agent/latest/flow/reference/components/mimir.rules.kubernetes/
mimir.rules.kubernetes "default" {
// the secret needs to be referenced by a remote.kubernetes.secret block (done by the config in externalServices)
address = nonsensitive(remote.kubernetes.secret.logs_service.data["MIMIR_ADDRESS"])
basic_auth {
username = nonsensitive(remote.kubernetes.secret.logs_service.data["MIMIR_TENANT_ID"])
password = remote.kubernetes.secret.logs_service.data["MIMIR_API_KEY"]
}
}
- Add a PrometheusRule in a created namespace, containing several alerts.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
role: alert-rules
name: my-api-prometheus
namespace: pr-xxxx-yyyy
spec:
groups:
- name: alerts-my-api
rules:
- alert: BlackboxProbeFailed
annotations:
description: Service my-api is down for more than 2 minutes.
summary: my-api API is down!
expr: probe_success{service="my-api"} == 0
for: 2m
labels:
service: my-api
severity: warning
- alert: KubernetesPodCrashLooping
annotations:
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping.
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
expr: |-
increase(
kube_pod_container_status_restarts_total{pod=~"my-api.*", namespace="pr-xxxx-yyyy"}[1m]
) > 3
for: 2m
labels:
service: my-api
severity: warning
System information
Agent is running on Linux amd64 t3a.medium (AWS - EKS)
Software version
agent:v0.40.3
Configuration
No response
Logs
ts=2024-03-29T08:43:00.958144643Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.kubelet duration=319.949µs
ts=2024-03-29T08:43:00.95898935Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.cadvisor duration=2.616273ms
ts=2024-03-29T08:43:00.959479126Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.cadvisor duration=428.004µs
ts=2024-03-29T08:43:06.164666761Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:43:35.99228665Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:43:36.092533519Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:44:06.137868933Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:44:35.992238997Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:44:36.074239084Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:44:40.957206227Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.kubelet duration=585.428µs
ts=2024-03-29T08:44:40.957614201Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.kubelet duration=280.44µs
ts=2024-03-29T08:44:40.958029334Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.cadvisor duration=1.418136ms
ts=2024-03-29T08:44:40.958610323Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.cadvisor duration=475.486µs
ts=2024-03-29T08:44:46.237538954Z level=info msg="processing event" component=mimir.rules.kubernetes.default type=resource-changed key=pr-2646-unify-docker-postgres-in-1/authentication-prometheus
ts=2024-03-29T08:44:46.340554683Z level=info msg="updated rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:45:06.096739738Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:45:35.992486801Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:45:36.112461898Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:46:05.957054716Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.kubelet duration=635.101µs
ts=2024-03-29T08:46:05.957656296Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.kubelet duration=501.917µs
ts=2024-03-29T08:46:05.958030929Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.cadvisor duration=1.640294ms
ts=2024-03-29T08:46:05.958501184Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.cadvisor duration=265.988µs
ts=2024-03-29T08:46:06.100316322Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:46:35.992353439Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:46:36.079171221Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:47:06.093045379Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:47:35.992322813Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:47:36.107329395Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
Hi there :wave:
On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.
To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)
Okay thanks for your message @rfratto :)
This issue has not had any activity in the past 30 days, so the needs-attention
label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention
label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!
We are observing this same thing happening with our Alloy and self-hosted Mimir.
Sorry for the delay on an update here. Clustering Alloy instances is usually the source of the issue here, where multiple Alloy instances are fighting over which instance should be writing the rules. With the 1.1 release of Alloy, mimir.rules.kubernetes is clustering-aware and avoids this issue:
Alloy version 1.1 and higher supports clustered mode in this component. When you use this component as part of a cluster of Alloy instances, only a single instance from the cluster will update rules using the Mimir API.
This fix will be backported to Grafana Agent in the near future.
If you are not using clustering, double check to see that there aren't multiple Alloy instances running and syncroninzing the same PrometheusRule resources with Mimir.
I'm not using clustering, and there is only one Alloy instance having 2 separate mimir.rules.kubernetes configured.
Alloy v1.1.1 in use.
The Mimir ruler logs have these kind of things:
ts=2024-06-17T12:52:08.732120906Z caller=ruler.go:564 level=info msg="syncing rules" reason=periodic ts=2024-06-17T12:56:58.749657873Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T12:56:58.952421482Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-alerts%2Fb0a4da42-9f74-4ff7-876c-5ee63ba12173 ts=2024-06-17T12:56:58.954207552Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-rules%2Fade2d112-54fe-4a69-865d-9a67eef2f6ad ts=2024-06-17T12:57:08.749366526Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:01:47.142018894Z caller=spanlogger.go:109 method=API.ListRules user=anonymous level=info msg="no rule groups found" userID=anonymous ts=2024-06-17T13:01:58.749370709Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:01:58.920564371Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-alerts%2Fc2cd8835-9964-4834-84bd-e01211dfb7c8 ts=2024-06-17T13:01:58.920796737Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-rules%2F269bcf08-7829-4efa-a45f-2fdefc2f37ac ts=2024-06-17T13:02:57.830975946Z caller=ruler.go:564 level=info msg="syncing rules" reason=periodic ts=2024-06-17T13:06:58.749411084Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:06:58.942997428Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-alerts%2Fb0a4da42-9f74-4ff7-876c-5ee63ba12173 ts=2024-06-17T13:06:58.944327752Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-rules%2Fade2d112-54fe-4a69-865d-9a67eef2f6ad ts=2024-06-17T13:11:58.749848971Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:11:58.923523066Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-alerts%2Fc2cd8835-9964-4834-84bd-e01211dfb7c8 ts=2024-06-17T13:11:58.923779667Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-rules%2F269bcf08-7829-4efa-a45f-2fdefc2f37ac
@rfratto I am experiencing the same.
I disabled clustering, set sts replicas to 1, however alloy keeps recreating the rules:
ts=2024-07-01T14:27:14.081313222Z level=info msg="removed rule group" component_path=/ component_id=mimir.rules.kubernetes.grafana_mimir namespace=alloy/default/<redacted>/7cc51093-3400-4e49-bb15-910a5b0e2076 group=<redacted>
ts=2024-07-01T14:27:14.211113442Z level=info msg="added rule group" component_path=/ component_id=mimir.rules.kubernetes.grafana_mimir namespace=alloy/default/<redacted>/7cc51093-3400-4e49-bb15-910a5b0e2076 group=<redacted>
i am running with 1.1.x
alloy, version v1.1.0 (branch: HEAD, revision: cf46a1491)
build user: root@buildkitsandbox
build date: 2024-05-14T21:07:39Z
go version: go1.22.3
platform: linux/amd64
tags: netgo,builtinassets,promtail_journal_enabled
Alright, i think that i found what is causing this never ending loop of recording rules recreation:
I have alloy installed in multiple clusters, I enabled mimir.rules in all of those clusters, however all of them communicate with a central mimir ruler (single tenant).
I noticed that cluster A's alloy is deleting cluster B's recording rules and vice-versa, as well as they try to recreate only the rules that exist in the local state.
https://github.com/grafana/alloy/blob/5d7b707eafe3096e1e477cda600fac8e976f4734/internal/component/loki/rules/kubernetes/events.go#L105
is there a correct configuration for this setup when not using multiple tenants?
@56quarters ^ Do the Mimir folks have any opinions about how this should be handled from clients?
I believe the mimir_namespace_prefix
option is intended to fix the case where you have multiple clusters, each with their own Alloy setup, talking to a single central Mimir.
In your case @iarlyy I think you'd want something like this:
Cluster A config:
mimir.rules.kubernetes "local" {
address = "mimir:8080"
tenant_id = "whatever"
mimir_namespace_prefix = "alloy-a"
}
Cluster B config:
mimir.rules.kubernetes "local" {
address = "mimir:8080"
tenant_id = "whatever"
mimir_namespace_prefix = "alloy-b"
}
This would ensure the Alloy instances (Alloys?) for each cluster are making changes to different sets of rules in Mimir.
@56quarters I figured that out yesterday, and it solved my issue :).
Thanks for looking into it.
As I already mentioned, I only have 1 Alloy running non-clustered, but adding a different "mimir_namespace_prefix" to all the "mimir.rules.kubernetes" blocks gets rid of the constant delete/recreate cycle.
Thanks @56quarters for suggesting this!