alloy icon indicating copy to clipboard operation
alloy copied to clipboard

Alloy DaemonSet stopped working properly with `was collected before with the same name and label values`

Open alex-berger opened this issue 1 year ago • 7 comments

What's wrong?

A few days ago, suddenly some of my Alloy DameonSet Pods stopped working properly, resulting in server errors (500) when accessing their /metrics endpoint. The DaemonSet was still running, no container crashes/restarts (readinessProbe was doing fine), but it appeared that it was no longer properly working.

Looks like this was caused by errors of the kind was collected before with the same name and label values. I have not observed this before and after restarting the Pods the problem disappeared. Therefore, I suspect that there must be some kind of race-condition bug in Alloy. Here an example of the output from the /metrics endpoint.

An error has occurred while serving metrics:

15 error(s) occurred:
* collected metric "workload_log_lines_total" { label:{name:"app" value:"loki"} label:{name:"component" value:"gateway"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"nginx"} label:{name:"controller" value:"loki-gateway-58bdc5cfc9"} label:{name:"controller_kind" value:"ReplicaSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"loki-system-loki-distributed"} label:{name:"ip" value:"10.176.44.205"} label:{name:"job" value:"loki-system/loki-gateway-58bdc5cfc9-xt256"} label:{name:"namespace" value:"loki-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"loki-gateway-58bdc5cfc9-xt256"} label:{name:"stream" value:"stderr"} counter:{value:1682 created_timestamp:{seconds:1719579313 nanos:848869559}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"cilium-agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"cilium-agent"} label:{name:"controller" value:"cilium"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"ip" value:"10.176.45.209"} label:{name:"job" value:"kube-system/cilium-9gcgw"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"part_of" value:"cilium"} label:{name:"pod" value:"cilium-9gcgw"} label:{name:"stream" value:"stderr"} counter:{value:104 created_timestamp:{seconds:1719579314 nanos:850782828}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"datadog"} label:{name:"component" value:"agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"agent"} label:{name:"controller" value:"datadog"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"datadog"} label:{name:"ip" value:"10.176.32.131"} label:{name:"job" value:"kube-system/datadog-7wj5j"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"datadog-7wj5j"} label:{name:"stream" value:"stdout"} counter:{value:582 created_timestamp:{seconds:1719579319 nanos:108932331}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"datadog"} label:{name:"component" value:"agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"trace-agent"} label:{name:"controller" value:"datadog"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"datadog"} label:{name:"ip" value:"10.176.32.131"} label:{name:"job" value:"kube-system/datadog-7wj5j"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"datadog-7wj5j"} label:{name:"stream" value:"stdout"} counter:{value:55 created_timestamp:{seconds:1719579370 nanos:107981085}}} was collected before with the same name and label values
...

What immediately caught my attention is, that the affected metrics are all from stage.metrics blocks (inside a loki.process component). See below configuration for more details.

Steps to reproduce

This happens sporadically once in a while and I have not yet figured out how to reliably reproduce it. That's also why I suspect that this is a kind of a "race condition" bug.

System information

Linux 6.1.92, amd64 and arm64, Bottlerocket OS 1.20.3 (aws-k8s-1.29)

Software version

Grafana Alloy 1.2.0

Configuration

logging {
  level  = "warn"
  format = "json"
}

loki.write "default" {
  endpoint {
    url = "http://loki-gateway.loki-system.svc.cluster.local/loki/api/v1/push"
    remote_timeout = "5s"
    batch_wait = "10s"
  }
  wal {
    enabled = true
    max_segment_age = "15m"
  }
}

prometheus.remote_write "default" {
    endpoint {
      url = "http://cortex-nginx.cortex-system/api/v1/push"
      remote_timeout = "5s"
      // Disable exemplars as work around for https://github.com/grafana/alloy/issues/1117
      send_exemplars = false
      metadata_config {
        send = true
      }
      queue_config {
        capacity = 10000
        max_samples_per_send = 10000
        batch_send_deadline = "5s"
        min_backoff = "30ms"
        max_backoff = "5s"
        retry_on_http_429 = true
      }
    }
    wal {
      truncate_frequency = "15m"
      max_keepalive_time = "30m"
    }
}

prometheus.exporter.self "alloy" {}

prometheus.scrape "alloy" {
  targets    = prometheus.exporter.self.alloy.targets
  forward_to = [prometheus.relabel.alloy.receiver]
  honor_labels = true
  scrape_interval = "60s"
}

prometheus.relabel "alloy" {
  forward_to = [prometheus.remote_write.default.receiver]
  rule {
    source_labels = ["instance"]
    target_label = "node_name"
  }
  rule {
    replacement = env("POD_NAME")
    target_label = "pod"
  }
  rule {
    replacement = env("POD_NAMESPACE")
    target_label = "namespace"
  }
  rule {
    action = "labeldrop"
    regex = "instance"
  }
  rule {
    source_labels = ["instance"]
    replacement = "alloy-daemonset/" + env("POD_NAME")
    target_label = "job"
  }
}

discovery.kubelet "local_pods" {
  url = "https://" + env("HOSTNAME") + ":10250"
  bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  tls_config {
    ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
    insecure_skip_verify = false
  }
}

discovery.relabel "pod_logs" {
  targets = discovery.kubelet.local_pods.targets
  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label  = "pod"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_container_name"]
    target_label  = "container"
  }
  rule {
    source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_name"]
    separator     = "/"
    target_label  = "job"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
    separator     = "/"
    action        = "replace"
    replacement   = "/var/log/pods/*$1/*.log"
    target_label  = "__path__"
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_pod_container_id"]
    regex = "^(\\w+):\\/\\/.+$"
    replacement = "$1"
    target_label = "tmp_container_runtime"
  }
  rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_ip"]
        target_label = "ip"
  }
  rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_host_ip"]
        target_label = "host_ip"
  }
  rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_node_name"]
        target_label = "node_name"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_controller_kind"]
        target_label = "controller_kind"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_controller_name"]
        target_label = "controller"
  }
  rule {
        action = "replace"
        regex = "^(.+?)(-[a-z0-9]+){0,2}$"
        source_labels = ["__meta_kubernetes_pod_name"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "^(.+?)(-[a-z0-9]+){0,1}$"
        source_labels = ["__meta_kubernetes_pod_controller_name"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_component"]
        target_label = "component"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_component"]
        target_label = "component"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_version"]
        target_label = "version"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_part_of"]
        target_label = "part_of"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_instance"]
        target_label = "instance"
  }
}

local.file_match "pod_logs" {
  path_targets = discovery.relabel.pod_logs.output
}

loki.source.file "pod_logs" {
  targets    = local.file_match.pod_logs.targets
  forward_to = [loki.process.pod_logs.receiver]
}

// basic processing to parse the container format. You can add additional processing stages
// to match your application logs.
loki.process "pod_logs" {
  stage.match {
    selector = "{tmp_container_runtime=\"containerd\"}"
    // the cri processing stage extracts the following k/v pairs: log, stream, time, flags
    stage.cri {}
    // Set the extract flags and stream values as labels
    stage.labels {
      values = {
        flags   = "",
        stream  = "",
      }
    }
  }

  // drop the temporary container runtime label as it is no longer needed
  stage.label_drop {
    values = ["tmp_container_runtime", "filename"]
  }

  // Remove the trailing newlines from JSON.
  // While trailing newlines are stripped by logcli, they would lead to intermittent empty lines for multiline logs.
  stage.replace {
    expression = "(\n)$"
    replace = ""
  }

  stage.multiline {
    firstline = "^\\x{200B}\\x{200D}\\x{200B}"
    max_wait_time = "3s"
  }

  stage.metrics {
    metric.counter {
        name        = "log_lines_total"
        description = "total number of log lines"
        prefix      = "workload_"
        match_all         = true
        action            = "inc"
        max_idle_duration = "1h"
    }
    metric.counter {
        name        = "raw_log_bytes_total"
        description = "total bytes of log lines before truncating long lines"
        prefix      = "workload_"
        match_all         = true
        count_entry_bytes = true
        action            = "add"
        max_idle_duration = "1h"
    }
  }

  /* //TODO enable this later
  // Truncate log lines longer than 256KiB
  stage.replace {
    expression = ".+"
    replace = "{{if gt (.Value | len) 262144 }}{{slice .Value 0 262144}}...(truncated after 262144 of {{(.Value | len)}} characters){{else}}{{.Value}}{{end}}"
  }
  */

  stage.metrics {
    metric.counter {
        name        = "truncated_log_bytes_total"
        description = "total bytes of log lines after tuncating long lines"
        prefix      = "workload_"
        match_all         = true
        count_entry_bytes = true
        action            = "add"
        max_idle_duration = "1h"
    }
  }

  forward_to = [loki.write.default.receiver]
}

Logs

<<empty log, no errors or warnings in the logs>>

alex-berger avatar Jul 01 '24 07:07 alex-berger

No 100% sure yet, but it looks like I can provoke this by updating the configuration (ConfigMap), the resulting config reload seems to trigger the above described behavior. In my case I am simply changing the logging.level back and forth from info to warn and vise versa to make sure the config changes. It's still racy as not all Pods from the DaemonSet become nonfunctional, but some do.

alex-berger avatar Jul 01 '24 14:07 alex-berger

As a work-around we added a livenessProbe to the Alloy DaemonSet, which will make sure the alloy container is restarted if it (the metrics endpoint) becomes unhealthy (starts responding with status code 5XX).

    containers:
    - name: alloy
      livenessProbe:
        httpGet:
          path: /metrics
          port: 12345
          scheme: HTTP
        initialDelaySeconds: 30
        timeoutSeconds: 2
        periodSeconds: 30
        successThreshold: 1
        failureThreshold: 3

alex-berger avatar Jul 02 '24 15:07 alex-berger

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

github-actions[bot] avatar Aug 02 '24 00:08 github-actions[bot]

Have the same issue but with statefulset

slimes28 avatar Sep 05 '24 13:09 slimes28

We're seeing the same issue with the /metrics endpoint when running Alloy 1.3.1-1 via a systemd service directly on servers (as a replacement for promtail).

It looks to happen in approximately 20-25% of alloy.service restarts across our fleet of servers.

As above, looks to be complaining about metrics generated by stage.metrics rather than the built in Alloy app metrics, e.g.:

$ curl localhost:12345/metrics
An error has occurred while serving metrics:

2 error(s) occurred:
* collected metric "journald_severity_lines_total" { label:{name:"component_id" value:"loki.process.journal"} label:{name:"component_path" value:"/"} label:{name:"identifier" value:"dns"} label:{name:"job" value:"loki.source.journal.journal"} label:{name:"priority" value:"info"} label:{name:"unit" value:"dns.service"} counter:{value:1 created_timestamp:{seconds:1727273371 nanos:383254028}}} was collected before with the same name and label values
* collected metric "journald_severity_lines_total" { label:{name:"component_id" value:"loki.process.journal"} label:{name:"component_path" value:"/"} label:{name:"identifier" value:"dns"} label:{name:"job" value:"loki.source.journal.journal"} label:{name:"priority" value:"info"} label:{name:"unit" value:"dns.service"} counter:{value:5981 created_timestamp:{seconds:1727273371 nanos:384823601}}} was collected before with the same name and label values

And the relevant config file section:

loki.process "journal" {
        forward_to = [loki.write.default.receiver]

        stage.match {
                selector = "{unit=~\".+\"}"

                stage.label_drop {
                        values = ["cmdline", "exe", "priorityval"]
                }

                stage.metrics {
                        metric.counter {
                                name              = "severity_lines_total"
                                description       = "total number of log lines"
                                prefix            = "journald_"
                                max_idle_duration = "24h"
                                match_all         = true
                                action            = "inc"
                        }
                }

                stage.labels {
                        values = {
                                cmdline     = null,
                                exe         = null,
                                priorityval = null,
                        }
                }
        }
}

Edit: Spotted this where something similar was happening with loki.source.heroku https://github.com/grafana/alloy/commit/266484c701e123326d508ececf8a9112a3351b6e

heapdavid avatar Sep 25 '24 14:09 heapdavid

Actually on further testing mine seems to be broken by the second stage.labels stanza - changing it to the following fixes the issue on restart:

                stage.labels {
                        values = {
                                cmdline     = "cmdline",
                                exe         = "exe",
                                priorityval = "priorityval",
                        }
                }

Seems different enough to yours to be a different problem, will open a new issue

heapdavid avatar Sep 26 '24 14:09 heapdavid

Btw, this issue was already present from promtail and reported on this issue https://github.com/grafana/loki/issues/8174...

alidorludovic avatar Sep 27 '24 14:09 alidorludovic

~~@alex-berger Please report this issue to the Grafana Alloy repository.~~

chaudum avatar Oct 19 '24 18:10 chaudum

Sorry, I came here from an issue in Loki, and was under the impression that I was looking at an issue in the Grafana/Loki repository.

chaudum avatar Oct 19 '24 19:10 chaudum