alloy
alloy copied to clipboard
Alloy DaemonSet stopped working properly with `was collected before with the same name and label values`
What's wrong?
A few days ago, suddenly some of my Alloy DameonSet Pods stopped working properly, resulting in server errors (500) when accessing their /metrics endpoint. The DaemonSet was still running, no container crashes/restarts (readinessProbe was doing fine), but it appeared that it was no longer properly working.
Looks like this was caused by errors of the kind was collected before with the same name and label values. I have not observed this before and after restarting the Pods the problem disappeared. Therefore, I suspect that there must be some kind of race-condition bug in Alloy. Here an example of the output from the /metrics endpoint.
An error has occurred while serving metrics:
15 error(s) occurred:
* collected metric "workload_log_lines_total" { label:{name:"app" value:"loki"} label:{name:"component" value:"gateway"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"nginx"} label:{name:"controller" value:"loki-gateway-58bdc5cfc9"} label:{name:"controller_kind" value:"ReplicaSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"loki-system-loki-distributed"} label:{name:"ip" value:"10.176.44.205"} label:{name:"job" value:"loki-system/loki-gateway-58bdc5cfc9-xt256"} label:{name:"namespace" value:"loki-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"loki-gateway-58bdc5cfc9-xt256"} label:{name:"stream" value:"stderr"} counter:{value:1682 created_timestamp:{seconds:1719579313 nanos:848869559}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"cilium-agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"cilium-agent"} label:{name:"controller" value:"cilium"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"ip" value:"10.176.45.209"} label:{name:"job" value:"kube-system/cilium-9gcgw"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"part_of" value:"cilium"} label:{name:"pod" value:"cilium-9gcgw"} label:{name:"stream" value:"stderr"} counter:{value:104 created_timestamp:{seconds:1719579314 nanos:850782828}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"datadog"} label:{name:"component" value:"agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"agent"} label:{name:"controller" value:"datadog"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"datadog"} label:{name:"ip" value:"10.176.32.131"} label:{name:"job" value:"kube-system/datadog-7wj5j"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"datadog-7wj5j"} label:{name:"stream" value:"stdout"} counter:{value:582 created_timestamp:{seconds:1719579319 nanos:108932331}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"datadog"} label:{name:"component" value:"agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"trace-agent"} label:{name:"controller" value:"datadog"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"datadog"} label:{name:"ip" value:"10.176.32.131"} label:{name:"job" value:"kube-system/datadog-7wj5j"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"datadog-7wj5j"} label:{name:"stream" value:"stdout"} counter:{value:55 created_timestamp:{seconds:1719579370 nanos:107981085}}} was collected before with the same name and label values
...
What immediately caught my attention is, that the affected metrics are all from stage.metrics blocks (inside a loki.process component). See below configuration for more details.
Steps to reproduce
This happens sporadically once in a while and I have not yet figured out how to reliably reproduce it. That's also why I suspect that this is a kind of a "race condition" bug.
System information
Linux 6.1.92, amd64 and arm64, Bottlerocket OS 1.20.3 (aws-k8s-1.29)
Software version
Grafana Alloy 1.2.0
Configuration
logging {
level = "warn"
format = "json"
}
loki.write "default" {
endpoint {
url = "http://loki-gateway.loki-system.svc.cluster.local/loki/api/v1/push"
remote_timeout = "5s"
batch_wait = "10s"
}
wal {
enabled = true
max_segment_age = "15m"
}
}
prometheus.remote_write "default" {
endpoint {
url = "http://cortex-nginx.cortex-system/api/v1/push"
remote_timeout = "5s"
// Disable exemplars as work around for https://github.com/grafana/alloy/issues/1117
send_exemplars = false
metadata_config {
send = true
}
queue_config {
capacity = 10000
max_samples_per_send = 10000
batch_send_deadline = "5s"
min_backoff = "30ms"
max_backoff = "5s"
retry_on_http_429 = true
}
}
wal {
truncate_frequency = "15m"
max_keepalive_time = "30m"
}
}
prometheus.exporter.self "alloy" {}
prometheus.scrape "alloy" {
targets = prometheus.exporter.self.alloy.targets
forward_to = [prometheus.relabel.alloy.receiver]
honor_labels = true
scrape_interval = "60s"
}
prometheus.relabel "alloy" {
forward_to = [prometheus.remote_write.default.receiver]
rule {
source_labels = ["instance"]
target_label = "node_name"
}
rule {
replacement = env("POD_NAME")
target_label = "pod"
}
rule {
replacement = env("POD_NAMESPACE")
target_label = "namespace"
}
rule {
action = "labeldrop"
regex = "instance"
}
rule {
source_labels = ["instance"]
replacement = "alloy-daemonset/" + env("POD_NAME")
target_label = "job"
}
}
discovery.kubelet "local_pods" {
url = "https://" + env("HOSTNAME") + ":10250"
bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
tls_config {
ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
insecure_skip_verify = false
}
}
discovery.relabel "pod_logs" {
targets = discovery.kubelet.local_pods.targets
rule {
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
}
rule {
source_labels = ["__meta_kubernetes_pod_name"]
target_label = "pod"
}
rule {
source_labels = ["__meta_kubernetes_pod_container_name"]
target_label = "container"
}
rule {
source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_name"]
separator = "/"
target_label = "job"
}
rule {
source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
separator = "/"
action = "replace"
replacement = "/var/log/pods/*$1/*.log"
target_label = "__path__"
}
rule {
action = "replace"
source_labels = ["__meta_kubernetes_pod_container_id"]
regex = "^(\\w+):\\/\\/.+$"
replacement = "$1"
target_label = "tmp_container_runtime"
}
rule {
action = "replace"
source_labels = ["__meta_kubernetes_pod_ip"]
target_label = "ip"
}
rule {
action = "replace"
source_labels = ["__meta_kubernetes_pod_host_ip"]
target_label = "host_ip"
}
rule {
action = "replace"
source_labels = ["__meta_kubernetes_pod_node_name"]
target_label = "node_name"
}
rule {
action = "replace"
regex = "(.+)"
source_labels = ["__meta_kubernetes_pod_controller_kind"]
target_label = "controller_kind"
}
rule {
action = "replace"
regex = "(.+)"
source_labels = ["__meta_kubernetes_pod_controller_name"]
target_label = "controller"
}
rule {
action = "replace"
regex = "^(.+?)(-[a-z0-9]+){0,2}$"
source_labels = ["__meta_kubernetes_pod_name"]
target_label = "app"
}
rule {
action = "replace"
regex = "^(.+?)(-[a-z0-9]+){0,1}$"
source_labels = ["__meta_kubernetes_pod_controller_name"]
target_label = "app"
}
rule {
action = "replace"
regex = "(.+)"
source_labels = ["__meta_kubernetes_pod_label_app"]
target_label = "app"
}
rule {
action = "replace"
regex = "(.+)"
source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
target_label = "app"
}
rule {
action = "replace"
regex = "(.+)"
source_labels = ["__meta_kubernetes_pod_label_component"]
target_label = "component"
}
rule {
action = "replace"
regex = "(.+)"
source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_component"]
target_label = "component"
}
rule {
action = "replace"
regex = "(.+)"
source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_version"]
target_label = "version"
}
rule {
action = "replace"
regex = "(.+)"
source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_part_of"]
target_label = "part_of"
}
rule {
action = "replace"
regex = "(.+)"
source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_instance"]
target_label = "instance"
}
}
local.file_match "pod_logs" {
path_targets = discovery.relabel.pod_logs.output
}
loki.source.file "pod_logs" {
targets = local.file_match.pod_logs.targets
forward_to = [loki.process.pod_logs.receiver]
}
// basic processing to parse the container format. You can add additional processing stages
// to match your application logs.
loki.process "pod_logs" {
stage.match {
selector = "{tmp_container_runtime=\"containerd\"}"
// the cri processing stage extracts the following k/v pairs: log, stream, time, flags
stage.cri {}
// Set the extract flags and stream values as labels
stage.labels {
values = {
flags = "",
stream = "",
}
}
}
// drop the temporary container runtime label as it is no longer needed
stage.label_drop {
values = ["tmp_container_runtime", "filename"]
}
// Remove the trailing newlines from JSON.
// While trailing newlines are stripped by logcli, they would lead to intermittent empty lines for multiline logs.
stage.replace {
expression = "(\n)$"
replace = ""
}
stage.multiline {
firstline = "^\\x{200B}\\x{200D}\\x{200B}"
max_wait_time = "3s"
}
stage.metrics {
metric.counter {
name = "log_lines_total"
description = "total number of log lines"
prefix = "workload_"
match_all = true
action = "inc"
max_idle_duration = "1h"
}
metric.counter {
name = "raw_log_bytes_total"
description = "total bytes of log lines before truncating long lines"
prefix = "workload_"
match_all = true
count_entry_bytes = true
action = "add"
max_idle_duration = "1h"
}
}
/* //TODO enable this later
// Truncate log lines longer than 256KiB
stage.replace {
expression = ".+"
replace = "{{if gt (.Value | len) 262144 }}{{slice .Value 0 262144}}...(truncated after 262144 of {{(.Value | len)}} characters){{else}}{{.Value}}{{end}}"
}
*/
stage.metrics {
metric.counter {
name = "truncated_log_bytes_total"
description = "total bytes of log lines after tuncating long lines"
prefix = "workload_"
match_all = true
count_entry_bytes = true
action = "add"
max_idle_duration = "1h"
}
}
forward_to = [loki.write.default.receiver]
}
Logs
<<empty log, no errors or warnings in the logs>>
No 100% sure yet, but it looks like I can provoke this by updating the configuration (ConfigMap), the resulting config reload seems to trigger the above described behavior. In my case I am simply changing the logging.level back and forth from info to warn and vise versa to make sure the config changes. It's still racy as not all Pods from the DaemonSet become nonfunctional, but some do.
As a work-around we added a livenessProbe to the Alloy DaemonSet, which will make sure the alloy container is restarted if it (the metrics endpoint) becomes unhealthy (starts responding with status code 5XX).
containers:
- name: alloy
livenessProbe:
httpGet:
path: /metrics
port: 12345
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 2
periodSeconds: 30
successThreshold: 1
failureThreshold: 3
This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!
Have the same issue but with statefulset
We're seeing the same issue with the /metrics endpoint when running Alloy 1.3.1-1 via a systemd service directly on servers (as a replacement for promtail).
It looks to happen in approximately 20-25% of alloy.service restarts across our fleet of servers.
As above, looks to be complaining about metrics generated by stage.metrics rather than the built in Alloy app metrics, e.g.:
$ curl localhost:12345/metrics
An error has occurred while serving metrics:
2 error(s) occurred:
* collected metric "journald_severity_lines_total" { label:{name:"component_id" value:"loki.process.journal"} label:{name:"component_path" value:"/"} label:{name:"identifier" value:"dns"} label:{name:"job" value:"loki.source.journal.journal"} label:{name:"priority" value:"info"} label:{name:"unit" value:"dns.service"} counter:{value:1 created_timestamp:{seconds:1727273371 nanos:383254028}}} was collected before with the same name and label values
* collected metric "journald_severity_lines_total" { label:{name:"component_id" value:"loki.process.journal"} label:{name:"component_path" value:"/"} label:{name:"identifier" value:"dns"} label:{name:"job" value:"loki.source.journal.journal"} label:{name:"priority" value:"info"} label:{name:"unit" value:"dns.service"} counter:{value:5981 created_timestamp:{seconds:1727273371 nanos:384823601}}} was collected before with the same name and label values
And the relevant config file section:
loki.process "journal" {
forward_to = [loki.write.default.receiver]
stage.match {
selector = "{unit=~\".+\"}"
stage.label_drop {
values = ["cmdline", "exe", "priorityval"]
}
stage.metrics {
metric.counter {
name = "severity_lines_total"
description = "total number of log lines"
prefix = "journald_"
max_idle_duration = "24h"
match_all = true
action = "inc"
}
}
stage.labels {
values = {
cmdline = null,
exe = null,
priorityval = null,
}
}
}
}
Edit: Spotted this where something similar was happening with loki.source.heroku https://github.com/grafana/alloy/commit/266484c701e123326d508ececf8a9112a3351b6e
Actually on further testing mine seems to be broken by the second stage.labels stanza - changing it to the following fixes the issue on restart:
stage.labels {
values = {
cmdline = "cmdline",
exe = "exe",
priorityval = "priorityval",
}
}
Seems different enough to yours to be a different problem, will open a new issue
Btw, this issue was already present from promtail and reported on this issue https://github.com/grafana/loki/issues/8174...
~~@alex-berger Please report this issue to the Grafana Alloy repository.~~
Sorry, I came here from an issue in Loki, and was under the impression that I was looking at an issue in the Grafana/Loki repository.