prometheus-k8s-operator icon indicating copy to clipboard operation
prometheus-k8s-operator copied to clipboard

[MetricsEndpointAggregator] Group name should be rendered using unit name, not app name

Open sed-i opened this issue 5 months ago • 1 comments

Bug Description

In #409 we changed the following signature:

-def _group_name(self, appname) -> str:
+def group_name(self, unit_name: str) -> str:

Note how the new signature expects to take a unit name instead of app name. That makes sense, because group names must be unique within a rules file, otherwise promtool complains.

So iiuc, we need to correct all the usages of that function to pass in a unit name instead of an app name:

https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L1858

https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L2120

https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L2144

To Reproduce

Deploy prom + offer:

# k8s model
bundle: kubernetes
applications:
  prom:
    charm: prometheus-k8s
    channel: edge
    scale: 1
    trust: true
--- # overlay.yaml
applications:
  prom:
    offers:
      prom:
        endpoints:
        - metrics-endpoint
        - receive-remote-write
        acl:
          admin: admin

Deploy cos-proxy + peripheral charms:

# lxd model
series: jammy
saas:
  prom:
    url: k8s:admin/cos.prom
applications:
  cp:
    charm: cos-proxy
    channel: latest/stable
    revision: 54
    series: focal
    num_units: 1
  tg:
    charm: telegraf
    channel: latest/stable
    revision: 75
  ub:
    charm: ubuntu
    channel: latest/stable
    revision: 24
    series: focal
    num_units: 1
relations:
- [cp:prometheus-rules, tg:prometheus-rules]
- [cp:prometheus-target, tg:prometheus-client]
- [cp:juju-info, tg:juju-info]
- [cp:downstream-prometheus-scrape, prom:metrics-endpoint]
- [tg:juju-info, ub:juju-info]

Now,

  1. Re-relate tg - ub
  2. Re-relate cp - prom
  3. Look at juju show-unit prom/0. You'll see that there are two duplicated group names with the same content. Confirm duplication with juju debug-log -i unit-prom-0 --replay | grep "Validating".
  4. (I haven't figured out yet why the first deployment is fine, but then any re-relate breaks it.)

Environment

  • Juju 3.1.6
  • Charms from edge, per bundles above

Relevant log output

unit-prom-0: 19:48:44.486 DEBUG unit.prom/0.juju-log metrics-endpoint:8: Validating the rules failed: b'error validating /tmp/tmpwl6ke64e/validate_rule.yaml: [226:3: groupname: "juju_testmodel_2a0f06a_tg_alert_rules" is repeated in the same file\ngithub.com/prometheus/prometheus/model/rulefmt.(*RuleGroups).Validate\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/prometheus/prometheus/model/rulefmt/rulefmt.go:90\ngithub.com/prometheus/prometheus/model/rulefmt.Parse\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/prometheus/prometheus/model/rulefmt/rulefmt.go:302\ngithub.com/canonical/cos-tool/pkg/tool.(*PromQL).ValidateRules\n\t/home/runner/work/cos-tool/cos-tool/pkg/tool/promql_transform.go:14\ngithub.com/canonical/cos-tool/cmd/root.glob..func2\n\t/home/runner/work/cos-tool/cos-tool/cmd/root/root.go:77\ngithub.com/urfave/cli/v2.(*Command).Run\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/command.go:169\ngithub.com/urfave/cli/v2.(*App).RunContext\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/app.go:341\ngithub.com/urfave/cli/v2.(*App).Run\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/app.go:247\ngithub.com/canonical/cos-tool/cmd/root.Execute\n\t/home/runner/work/cos-tool/cos-tool/cmd/root/root.go:124\nmain.main\n\t/home/runner/work/cos-tool/cos-tool/main.go:9\nruntime.main\n\t/opt/hostedtoolcache/go/1.18.10/x64/src/runtime/proc.go:250\nruntime.goexit\n\t/opt/hostedtoolcache/go/1.18.10/x64/src/runtime/asm_amd64.s:1571]\n'

Additional context

Rules files from cos-proxy are named in the prometheus workload container after the upstream unit, for example telegraf - cos-proxy - prometheus would be named after telegraf:

$ juju ssh -m k8s:admin/cos --container prometheus prom/0 ls /etc/prometheus/rules
juju_testmodel_2a0f06aa_tg_metrics-endpoint_8.rules

It's not enough for an alert group to be named differently: each alert definition must be unique on its own within the rule file, regardless of how it is nested under group names. Uniqueness is derived from alert name + alert labels. So the same alertname won't be flagged as duplicated, if the juju_unit label in each is different.


Convert alert rules in relation data to yaml rule files:

juju show-unit prometheus/0 | yq '."prometheus/0".relation-info | .[] | select(.endpoint == "metrics-endpoint" and .related-endpoint == "downstream-prometheus-scrape") | .application-data.alert_rules' | jq | yq -p json -o yaml > alert.rules

The above would correspond to a single file on disk in the prometheus workload container.

sed-i avatar Jan 31 '24 02:01 sed-i