prometheus-k8s-operator [MetricsEndpointAggregator] Group name should be rendered using unit name, not app name

Bug Description

In #409 we changed the following signature:

-def _group_name(self, appname) -> str:
+def group_name(self, unit_name: str) -> str:

Note how the new signature expects to take a unit name instead of app name. That makes sense, because group names must be unique within a rules file, otherwise promtool complains.

So iiuc, we need to correct all the usages of that function to pass in a unit name instead of an app name:

https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L1858

https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L2120

https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L2144

To Reproduce

Deploy prom + offer:

# k8s model
bundle: kubernetes
applications:
  prom:
    charm: prometheus-k8s
    channel: edge
    scale: 1
    trust: true
--- # overlay.yaml
applications:
  prom:
    offers:
      prom:
        endpoints:
        - metrics-endpoint
        - receive-remote-write
        acl:
          admin: admin

Deploy cos-proxy + peripheral charms:

# lxd model
series: jammy
saas:
  prom:
    url: k8s:admin/cos.prom
applications:
  cp:
    charm: cos-proxy
    channel: latest/stable
    revision: 54
    series: focal
    num_units: 1
  tg:
    charm: telegraf
    channel: latest/stable
    revision: 75
  ub:
    charm: ubuntu
    channel: latest/stable
    revision: 24
    series: focal
    num_units: 1
relations:
- [cp:prometheus-rules, tg:prometheus-rules]
- [cp:prometheus-target, tg:prometheus-client]
- [cp:juju-info, tg:juju-info]
- [cp:downstream-prometheus-scrape, prom:metrics-endpoint]
- [tg:juju-info, ub:juju-info]

Now,

Re-relate tg - ub
Re-relate cp - prom
Look at juju show-unit prom/0. You'll see that there are two duplicated group names with the same content. Confirm duplication with juju debug-log -i unit-prom-0 --replay | grep "Validating".
(I haven't figured out yet why the first deployment is fine, but then any re-relate breaks it.)

Environment

Juju 3.1.6
Charms from edge, per bundles above

Relevant log output

unit-prom-0: 19:48:44.486 DEBUG unit.prom/0.juju-log metrics-endpoint:8: Validating the rules failed: b'error validating /tmp/tmpwl6ke64e/validate_rule.yaml: [226:3: groupname: "juju_testmodel_2a0f06a_tg_alert_rules" is repeated in the same file\ngithub.com/prometheus/prometheus/model/rulefmt.(*RuleGroups).Validate\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/prometheus/prometheus/model/rulefmt/rulefmt.go:90\ngithub.com/prometheus/prometheus/model/rulefmt.Parse\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/prometheus/prometheus/model/rulefmt/rulefmt.go:302\ngithub.com/canonical/cos-tool/pkg/tool.(*PromQL).ValidateRules\n\t/home/runner/work/cos-tool/cos-tool/pkg/tool/promql_transform.go:14\ngithub.com/canonical/cos-tool/cmd/root.glob..func2\n\t/home/runner/work/cos-tool/cos-tool/cmd/root/root.go:77\ngithub.com/urfave/cli/v2.(*Command).Run\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/command.go:169\ngithub.com/urfave/cli/v2.(*App).RunContext\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/app.go:341\ngithub.com/urfave/cli/v2.(*App).Run\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/app.go:247\ngithub.com/canonical/cos-tool/cmd/root.Execute\n\t/home/runner/work/cos-tool/cos-tool/cmd/root/root.go:124\nmain.main\n\t/home/runner/work/cos-tool/cos-tool/main.go:9\nruntime.main\n\t/opt/hostedtoolcache/go/1.18.10/x64/src/runtime/proc.go:250\nruntime.goexit\n\t/opt/hostedtoolcache/go/1.18.10/x64/src/runtime/asm_amd64.s:1571]\n'

Additional context

Rules files from cos-proxy are named in the prometheus workload container after the upstream unit, for example telegraf - cos-proxy - prometheus would be named after telegraf:

$ juju ssh -m k8s:admin/cos --container prometheus prom/0 ls /etc/prometheus/rules
juju_testmodel_2a0f06aa_tg_metrics-endpoint_8.rules

It's not enough for an alert group to be named differently: each alert definition must be unique on its own within the rule file, regardless of how it is nested under group names. Uniqueness is derived from alert name + alert labels. So the same alertname won't be flagged as duplicated, if the juju_unit label in each is different.

Convert alert rules in relation data to yaml rule files:

juju show-unit prometheus/0 | yq '."prometheus/0".relation-info | .[] | select(.endpoint == "metrics-endpoint" and .related-endpoint == "downstream-prometheus-scrape") | .application-data.alert_rules' | jq | yq -p json -o yaml > alert.rules

The above would correspond to a single file on disk in the prometheus workload container.

Jan 31 '24 02:01 sed-i

We don't think anyone is still relating cos-proxy to telegraf or filebeat. So while this is a bug, we might not need to fix it. We should ask known users if this bug is relevant to them and close it if it does not affect anyone.

Feb 15 '24 14:02 dstathis

Since MetricsEndpointAggregator is only used in cos-proxy we should probably do this. we should also checks if the same behavior is happening with nrpe, if it doesn't we should close this ticket with no action.

Jul 18 '24 13:07 IbraAoad

prometheus-k8s-operator prometheus-k8s-operator copied to clipboard

[MetricsEndpointAggregator] Group name should be rendered using unit name, not app name

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

prometheus-k8s-operator
prometheus-k8s-operator copied to clipboard