agent failed to unregister all metrics from previous promtail. THIS IS A BUG

I want to use metrics pipeline_stages count the number of error rows in the log. Below is my configuration

apiVersion: monitoring.grafana.com/v1alpha1
kind: PodLogs
metadata:
  labels:
    app.kubernetes.io/instance: loki
    app.kubernetes.io/name: loki
  name: mysql
  namespace: loki
spec:
  namespaceSelector:
    matchNames:
    - default
  selector:
    matchLabels:
      app: mysql
  relabelings:
  - action: replace
    sourceLabels:
    - __meta_kubernetes_pod_node_name
    targetLabel: __host__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    replacement: mysql
    targetLabel: job
  - action: replace
    sourceLabels:
    - __meta_kubernetes_pod_container_name
    targetLabel: container
  - action: replace
    replacement: mysql
    targetLabel: cluster
  pipelineStages:
  - docker: {}
  - match:
      selector: '{container="mysql"} |~ ".*[ERROR].*"'
      stages: |
        - regex:
            expression: ".*(?P<error>[ERROR]).*"
        - metrics:
            error_log_lines_total:
              type: Counter
              description: The total number of error log lines
              source: error
              action: inc

After the loki-logs pod restarts, I get an error

$ kubectl -n loki logs -f loki-logs-2snl9 grafana-agent 
ts=2023-01-12T02:03:50.543250808Z caller=server.go:195 level=info msg="server listening on addresses" http=[::]:8080 grpc=127.0.0.1:12346 http_tls_enabled=false grpc_tls_enabled=false
ts=2023-01-12T02:03:50.579742291Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"
ts=2023-01-12T02:03:50.579889027Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2023-01-12T02:03:50.616783411Z caller=kubernetes.go:313 level=info component=logs logs_config=loki/loki component=discovery discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-01-12T02:03:50.973238948Z caller=kubernetes.go:313 level=info component=logs logs_config=loki/loki component=discovery discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-01-12T02:03:50Z level=info caller=traces/traces.go:143 msg="Traces Logger Initialized" component=traces
ts=2023-01-12T02:03:51.151720301Z caller=integrations.go:138 level=warn msg="integrations-next is enabled. integrations-next is subject to change"
ts=2023-01-12T02:03:51.220365818Z caller=reporter.go:107 level=info msg="running usage stats reporter"
ts=2023-01-12T02:03:55.336275999Z caller=entrypoint.go:246 level=info msg="reload of config file requested"
ts=2023-01-12T02:03:55.520499461Z caller=entrypoint.go:161 level=error msg="failed to update loki" err="failed to unregister all metrics from previous promtail. THIS IS A BUG"
ts=2023-01-12T02:03:55.520617191Z caller=entrypoint.go:257 level=error msg="failed to reload config file" err="changes did not apply successfully"
ts=2023-01-12T02:04:00.336698444Z caller=entrypoint.go:246 level=info msg="reload of config file requested"

When I delete metrics pipeline_stages, everything is normal. My deployment method is helm

helm install --namespace=loki --values values.yaml loki loki-3.8.0.tgz

Please help me with this, thanks

Jan 16 '23 01:01 kuberxy

@kuberxy could you please try v0.31.0 (which was released today) and let us know if the issue is gone for you?

Jan 31 '23 17:01 jcreixell

@kuberxy could you please try v0.31.0 (which was released today) and let us know if the issue is gone for you?

this issue is not gone.

$ kubectl -n loki logs loki-logs-sbfp8 grafana-agent
ts=2023-02-02T07:51:04.895353776Z caller=server.go:191 level=info msg="server listening on addresses" http=[::]:8080 grpc=127.0.0.1:12346 http_tls_enabled=false grpc_tls_enabled=false
ts=2023-02-02T07:51:04.896050123Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"
ts=2023-02-02T07:51:04.896177889Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2023-02-02T07:51:04.906062725Z caller=promtail.go:123 level=info component=logs logs_config=loki/loki msg="Reloading configuration file" md5sum=6e7026368fd78b062c8c20e5ce9e9b8a
ts=2023-02-02T07:51:04.907204839Z caller=kubernetes.go:326 level=info component=logs logs_config=loki/loki component=discovery discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-02T07:51:04.908589284Z caller=zapadapter.go:78 level=info component=traces msg="Traces Logger Initialized"
ts=2023-02-02T07:51:04.908651471Z caller=integrations.go:156 level=warn msg="integrations-next is enabled. integrations-next is subject to change"
ts=2023-02-02T07:51:04.911183301Z caller=reporter.go:103 level=info msg="running usage stats reporter"
ts=2023-02-02T07:51:08.808643551Z caller=entrypoint.go:337 level=info msg="reload of config file requested"
ts=2023-02-02T07:51:08.846228489Z caller=entrypoint.go:167 level=error msg="failed to update loki" err="failed to unregister all metrics from previous promtail. THIS IS A BUG"
ts=2023-02-02T07:51:08.846331922Z caller=entrypoint.go:348 level=error msg="failed to reload config file" err="changes did not apply successfully"
ts=2023-02-02T07:51:13.805653552Z caller=entrypoint.go:337 level=info msg="reload of config file requested"

$ kubectl -n loki describe pod loki-grafana-agent-operator-5df4cf9c46-ttf6r
...
Containers:
  grafana-agent-operator:
    Container ID:  docker://517a2247c825b95d8e67fb0e79d8f5ceac8b44817c2aba9da4951e9cd1c703c5
    Image:         docker.io/grafana/agent-operator:v0.31.0
    Image ID:      docker-pullable://grafana/agent-operator@sha256:01aace5fa7943dfaf1edc55e6562cd12e6888fa569ebb5a45962d3fef81407ac
    Port:          <none>
    Host Port:     <none>
    Args:
      --kubelet-service=default/kubelet
    State:          Running
      Started:      Thu, 02 Feb 2023 15:30:00 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wgzs8 (ro)
Conditions:
...

Feb 02 '23 08:02 kuberxy

I had a look at this today. Apparently Prometheus has a feature called "unchecked collectors". It is mentioned in their docs:

Alternatively, you could return no Desc at all, which will mark the Collector “unchecked”. No checks are performed at registration time, but metric consistency will still be ensured at scrape time, i.e. any inconsistencies will lead to scrape errors. Thus, with unchecked Collectors, the responsibility to not collect metrics that lead to inconsistencies in the total scrape result lies with the implementer of the Collector.

The underlying loki library uses unchecked collectors because the Desc function doesn't return anything. One problem with unchecked collectors is that even though Register() doesn't return any errors for them, Unregister() does return an error.

I tried to enhance agent/pkg/util/unregisterer.go with a function like this, so that it doesn't unregister unchecked collectors:

func isUncheckedCollector(c prometheus.Collector) bool {
	var (
		descChan = make(chan *prometheus.Desc, 10)
	)
	go func() {
		c.Describe(descChan)
		close(descChan)
	}()

	i := 0
	for range descChan {
		i += 1
	}

	return i == 0
}

This got rid of the error, but unfortunately the metric stopped working properly. It was still visible on /metrics, but it wouldn't increment properly even if new log lines came in. I suspect this is because the Prometheus registerer is still holding on to the old unchecked collector, and it is using it in Gather instead of the new one.

I suspect the reason unchecked collectors are used here is that due to a max_idle_duration config entry, the metric might disappear during the runtime of the Agent. A Desc implementation should always return the same description, and so wouldn't work well in this case.

I would like to try to get rid of the unchecked collectors in the loki code, and instead switch to normal collectors. There are several reasons:

It will make it much easier to wipe out all old metrics during a config reload.
It will be safer, because we will be able to ensure that an unchecked metric doesn't clash with a normal one.
There are some edge cases with max_idle_duration. What happens if the metric is almost at the end of its max_idle_duration, but then the config is reloaded? It would be logical to reset our max_idle_duration. This might be harder to do when we use unchecked collectors.
It's really not clear to me how to make Prometheus forget about the old unchecked Collector. The only solution I can think of is to somehow make it noop in our code when the config is refreshed, but this sounds really messy.

Jan 25 '24 18:01 ptodev

I tried using normal collectors such as prometheus.Counter, but it wasn't possible. Apparently, Prometheus registries don't allow registering the same metric multiple times if it has different const labels[^1]. I will keep using unchecked collectors, but I'll add a function like this to the metricvec:

func (c *metricVec) DeleteAll() {
	c.mtx.Lock()
	defer c.mtx.Unlock()
	c.metrics = map[model.Fingerprint]prometheus.Metric{}
}

This will cause the Collect to collect nothing, because all the metrics will be deleted.

I just need to find a good enough place to call DeleteAll(). Using the Stage interface's Run(chan Entry) chan Entry function is not enough, because Run could be called multiple times for the lifetime of the Pipeline. In fact, it has to be, because the output is only made available in the return statement of Run. A better design would have been to have Run accept input and output channels, run it once async for the process lifetime, and have it clean up its state when the input channel is canceled.

We need to persist the metrics between calls to Run. My plan for now is to:

Add a new Cleanup function to the Stage interface:

type Stage interface {
    Name() string
    Run(chan Entry) chan Entry
    Cleanup()
}

Add a new Cleanup function to the Pipeline struct in loki/clients/pkg/logentry/stages/pipeline.go.
Call the Pipeline Cleanup function whenever we create a new loki.NewEntryHandler in the Agent repo[^2]. This will then be called when we call promtail.Shutdown().

Also, I found another bug with the way these metrics are collected. Apparently, a Collect needs to be called in order for the max_idle_duration cleanup to kick in. All metrics are collected, even the ones which were meant to expire a long time ago. So if your scrape interval is long compared to max_idle_duration, you will still end up collecting (once) all those metrics that were supposed to be deleted. I will fix this as well, but probably on a separate PR.

[^1]: I could have sworn that I've been able to register identical metrics with differing const labels in the past. But now I realise that back then the registry was returning errors, and I hadn't caught the error at first. [^2]: I'm not sure if I will also update Promtail with this fix. While I was browsing the Promtail code, I think I came across a feature for reloading the config - so it might be an issue for Promtail too. For now I might just fix the Agent.

Feb 01 '24 12:02 ptodev