alloy icon indicating copy to clipboard operation
alloy copied to clipboard

Mimir.rules.kubernetes keeps deleting and re-creating rules on grafana cloud

Open Lp-Francois opened this issue 10 months ago • 12 comments

What's wrong?

I am using the block mimir.rules.kubernetes with the latest version of the grafana agent (flow mode) docker.io/grafana/agent:v0.40.3.

It uploads my PrometheusRule to Grafana Cloud remote mimir instance, but from the UI I can see my alerts being constantly deleted and then recreated, where it alternates between the 3 states. Here are 3 screenshots:

Screenshot 2024-03-29 at 08 53 05 Screenshot 2024-03-29 at 08 53 29 Screenshot 2024-03-29 at 08 54 31

Steps to reproduce

  1. Install the agent using helm in a Kubernetes cluster
  2. use this in the values.yaml:
        extraConfig: |-
          // documentation: https://grafana.com/docs/agent/latest/flow/reference/components/mimir.rules.kubernetes/
          mimir.rules.kubernetes "default" {
            // the secret needs to be referenced by a remote.kubernetes.secret block (done by the config in externalServices)
            address = nonsensitive(remote.kubernetes.secret.logs_service.data["MIMIR_ADDRESS"])
            basic_auth {
              username = nonsensitive(remote.kubernetes.secret.logs_service.data["MIMIR_TENANT_ID"])
              password = remote.kubernetes.secret.logs_service.data["MIMIR_API_KEY"]
            }
          }
  1. Add a PrometheusRule in a created namespace, containing several alerts.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    role: alert-rules
  name: my-api-prometheus
  namespace: pr-xxxx-yyyy
spec:
  groups:
    - name: alerts-my-api
      rules:
        - alert: BlackboxProbeFailed
          annotations:
            description: Service my-api is down for more than 2 minutes.
            summary: my-api API is down!
          expr: probe_success{service="my-api"} == 0
          for: 2m
          labels:
            service: my-api
            severity: warning
        - alert: KubernetesPodCrashLooping
          annotations:
            description: Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping.
            summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
          expr: |-
            increase(
              kube_pod_container_status_restarts_total{pod=~"my-api.*", namespace="pr-xxxx-yyyy"}[1m]
            ) > 3
          for: 2m
          labels:
            service: my-api
            severity: warning

System information

Agent is running on Linux amd64 t3a.medium (AWS - EKS)

Software version

agent:v0.40.3

Configuration

No response

Logs

ts=2024-03-29T08:43:00.958144643Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.kubelet duration=319.949µs
ts=2024-03-29T08:43:00.95898935Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.cadvisor duration=2.616273ms
ts=2024-03-29T08:43:00.959479126Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.cadvisor duration=428.004µs
ts=2024-03-29T08:43:06.164666761Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:43:35.99228665Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:43:36.092533519Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:44:06.137868933Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:44:35.992238997Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:44:36.074239084Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:44:40.957206227Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.kubelet duration=585.428µs
ts=2024-03-29T08:44:40.957614201Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.kubelet duration=280.44µs
ts=2024-03-29T08:44:40.958029334Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.cadvisor duration=1.418136ms
ts=2024-03-29T08:44:40.958610323Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.cadvisor duration=475.486µs
ts=2024-03-29T08:44:46.237538954Z level=info msg="processing event" component=mimir.rules.kubernetes.default type=resource-changed key=pr-2646-unify-docker-postgres-in-1/authentication-prometheus
ts=2024-03-29T08:44:46.340554683Z level=info msg="updated rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:45:06.096739738Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:45:35.992486801Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:45:36.112461898Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:46:05.957054716Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.kubelet duration=635.101µs
ts=2024-03-29T08:46:05.957656296Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.kubelet duration=501.917µs
ts=2024-03-29T08:46:05.958030929Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.cadvisor duration=1.640294ms
ts=2024-03-29T08:46:05.958501184Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.cadvisor duration=265.988µs
ts=2024-03-29T08:46:06.100316322Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:46:35.992353439Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:46:36.079171221Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:47:06.093045379Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:47:35.992322813Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:47:36.107329395Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication

Lp-Francois avatar Mar 29 '24 08:03 Lp-Francois

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

rfratto avatar Apr 11 '24 20:04 rfratto

Okay thanks for your message @rfratto :)

Lp-Francois avatar Apr 12 '24 13:04 Lp-Francois

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

github-actions[bot] avatar May 13 '24 00:05 github-actions[bot]

We are observing this same thing happening with our Alloy and self-hosted Mimir.

juupas avatar Jun 10 '24 08:06 juupas

Sorry for the delay on an update here. Clustering Alloy instances is usually the source of the issue here, where multiple Alloy instances are fighting over which instance should be writing the rules. With the 1.1 release of Alloy, mimir.rules.kubernetes is clustering-aware and avoids this issue:

Alloy version 1.1 and higher supports clustered mode in this component. When you use this component as part of a cluster of Alloy instances, only a single instance from the cluster will update rules using the Mimir API.

This fix will be backported to Grafana Agent in the near future.

If you are not using clustering, double check to see that there aren't multiple Alloy instances running and syncroninzing the same PrometheusRule resources with Mimir.

rfratto avatar Jun 11 '24 16:06 rfratto

I'm not using clustering, and there is only one Alloy instance having 2 separate mimir.rules.kubernetes configured.

Alloy v1.1.1 in use.

The Mimir ruler logs have these kind of things:

ts=2024-06-17T12:52:08.732120906Z caller=ruler.go:564 level=info msg="syncing rules" reason=periodic ts=2024-06-17T12:56:58.749657873Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T12:56:58.952421482Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-alerts%2Fb0a4da42-9f74-4ff7-876c-5ee63ba12173 ts=2024-06-17T12:56:58.954207552Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-rules%2Fade2d112-54fe-4a69-865d-9a67eef2f6ad ts=2024-06-17T12:57:08.749366526Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:01:47.142018894Z caller=spanlogger.go:109 method=API.ListRules user=anonymous level=info msg="no rule groups found" userID=anonymous ts=2024-06-17T13:01:58.749370709Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:01:58.920564371Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-alerts%2Fc2cd8835-9964-4834-84bd-e01211dfb7c8 ts=2024-06-17T13:01:58.920796737Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-rules%2F269bcf08-7829-4efa-a45f-2fdefc2f37ac ts=2024-06-17T13:02:57.830975946Z caller=ruler.go:564 level=info msg="syncing rules" reason=periodic ts=2024-06-17T13:06:58.749411084Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:06:58.942997428Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-alerts%2Fb0a4da42-9f74-4ff7-876c-5ee63ba12173 ts=2024-06-17T13:06:58.944327752Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-rules%2Fade2d112-54fe-4a69-865d-9a67eef2f6ad ts=2024-06-17T13:11:58.749848971Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:11:58.923523066Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-alerts%2Fc2cd8835-9964-4834-84bd-e01211dfb7c8 ts=2024-06-17T13:11:58.923779667Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-rules%2F269bcf08-7829-4efa-a45f-2fdefc2f37ac

juupas avatar Jun 17 '24 13:06 juupas

@rfratto I am experiencing the same.

I disabled clustering, set sts replicas to 1, however alloy keeps recreating the rules:

ts=2024-07-01T14:27:14.081313222Z level=info msg="removed rule group" component_path=/ component_id=mimir.rules.kubernetes.grafana_mimir namespace=alloy/default/<redacted>/7cc51093-3400-4e49-bb15-910a5b0e2076 group=<redacted>
ts=2024-07-01T14:27:14.211113442Z level=info msg="added rule group" component_path=/ component_id=mimir.rules.kubernetes.grafana_mimir namespace=alloy/default/<redacted>/7cc51093-3400-4e49-bb15-910a5b0e2076 group=<redacted>

i am running with 1.1.x

alloy, version v1.1.0 (branch: HEAD, revision: cf46a1491)
  build user:       root@buildkitsandbox
  build date:       2024-05-14T21:07:39Z
  go version:       go1.22.3
  platform:         linux/amd64
  tags:             netgo,builtinassets,promtail_journal_enabled

iarlyy avatar Jul 01 '24 14:07 iarlyy

Alright, i think that i found what is causing this never ending loop of recording rules recreation:

I have alloy installed in multiple clusters, I enabled mimir.rules in all of those clusters, however all of them communicate with a central mimir ruler (single tenant).

I noticed that cluster A's alloy is deleting cluster B's recording rules and vice-versa, as well as they try to recreate only the rules that exist in the local state.

https://github.com/grafana/alloy/blob/5d7b707eafe3096e1e477cda600fac8e976f4734/internal/component/loki/rules/kubernetes/events.go#L105

is there a correct configuration for this setup when not using multiple tenants?

iarlyy avatar Jul 01 '24 15:07 iarlyy

@56quarters ^ Do the Mimir folks have any opinions about how this should be handled from clients?

rfratto avatar Jul 02 '24 18:07 rfratto

I believe the mimir_namespace_prefix option is intended to fix the case where you have multiple clusters, each with their own Alloy setup, talking to a single central Mimir.

In your case @iarlyy I think you'd want something like this:

Cluster A config:

mimir.rules.kubernetes "local" {
    address = "mimir:8080"
    tenant_id = "whatever"
    mimir_namespace_prefix = "alloy-a"
}

Cluster B config:

mimir.rules.kubernetes "local" {
    address = "mimir:8080"
    tenant_id = "whatever"
    mimir_namespace_prefix = "alloy-b"
}

This would ensure the Alloy instances (Alloys?) for each cluster are making changes to different sets of rules in Mimir.

56quarters avatar Jul 02 '24 18:07 56quarters

@56quarters I figured that out yesterday, and it solved my issue :).

Thanks for looking into it.

iarlyy avatar Jul 03 '24 08:07 iarlyy

As I already mentioned, I only have 1 Alloy running non-clustered, but adding a different "mimir_namespace_prefix" to all the "mimir.rules.kubernetes" blocks gets rid of the constant delete/recreate cycle.

Thanks @56quarters for suggesting this!

juupas avatar Aug 21 '24 07:08 juupas