prometheus-k8s-operator
prometheus-k8s-operator copied to clipboard
[MetricsEndpointAggregator] Group name should be rendered using unit name, not app name
Bug Description
In #409 we changed the following signature:
-def _group_name(self, appname) -> str:
+def group_name(self, unit_name: str) -> str:
Note how the new signature expects to take a unit name instead of app name. That makes sense, because group names must be unique within a rules file, otherwise promtool complains.
So iiuc, we need to correct all the usages of that function to pass in a unit name instead of an app name:
https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L1858
https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L2120
https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L2144
To Reproduce
Deploy prom + offer:
# k8s model
bundle: kubernetes
applications:
prom:
charm: prometheus-k8s
channel: edge
scale: 1
trust: true
--- # overlay.yaml
applications:
prom:
offers:
prom:
endpoints:
- metrics-endpoint
- receive-remote-write
acl:
admin: admin
Deploy cos-proxy + peripheral charms:
# lxd model
series: jammy
saas:
prom:
url: k8s:admin/cos.prom
applications:
cp:
charm: cos-proxy
channel: latest/stable
revision: 54
series: focal
num_units: 1
tg:
charm: telegraf
channel: latest/stable
revision: 75
ub:
charm: ubuntu
channel: latest/stable
revision: 24
series: focal
num_units: 1
relations:
- [cp:prometheus-rules, tg:prometheus-rules]
- [cp:prometheus-target, tg:prometheus-client]
- [cp:juju-info, tg:juju-info]
- [cp:downstream-prometheus-scrape, prom:metrics-endpoint]
- [tg:juju-info, ub:juju-info]
Now,
- Re-relate
tg - ub
- Re-relate
cp - prom
- Look at
juju show-unit prom/0
. You'll see that there are two duplicated group names with the same content. Confirm duplication withjuju debug-log -i unit-prom-0 --replay | grep "Validating"
. - (I haven't figured out yet why the first deployment is fine, but then any re-relate breaks it.)
Environment
- Juju 3.1.6
- Charms from edge, per bundles above
Relevant log output
unit-prom-0: 19:48:44.486 DEBUG unit.prom/0.juju-log metrics-endpoint:8: Validating the rules failed: b'error validating /tmp/tmpwl6ke64e/validate_rule.yaml: [226:3: groupname: "juju_testmodel_2a0f06a_tg_alert_rules" is repeated in the same file\ngithub.com/prometheus/prometheus/model/rulefmt.(*RuleGroups).Validate\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/prometheus/prometheus/model/rulefmt/rulefmt.go:90\ngithub.com/prometheus/prometheus/model/rulefmt.Parse\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/prometheus/prometheus/model/rulefmt/rulefmt.go:302\ngithub.com/canonical/cos-tool/pkg/tool.(*PromQL).ValidateRules\n\t/home/runner/work/cos-tool/cos-tool/pkg/tool/promql_transform.go:14\ngithub.com/canonical/cos-tool/cmd/root.glob..func2\n\t/home/runner/work/cos-tool/cos-tool/cmd/root/root.go:77\ngithub.com/urfave/cli/v2.(*Command).Run\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/command.go:169\ngithub.com/urfave/cli/v2.(*App).RunContext\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/app.go:341\ngithub.com/urfave/cli/v2.(*App).Run\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/app.go:247\ngithub.com/canonical/cos-tool/cmd/root.Execute\n\t/home/runner/work/cos-tool/cos-tool/cmd/root/root.go:124\nmain.main\n\t/home/runner/work/cos-tool/cos-tool/main.go:9\nruntime.main\n\t/opt/hostedtoolcache/go/1.18.10/x64/src/runtime/proc.go:250\nruntime.goexit\n\t/opt/hostedtoolcache/go/1.18.10/x64/src/runtime/asm_amd64.s:1571]\n'
Additional context
Rules files from cos-proxy are named in the prometheus workload container after the upstream unit, for example telegraf - cos-proxy - prometheus
would be named after telegraf:
$ juju ssh -m k8s:admin/cos --container prometheus prom/0 ls /etc/prometheus/rules
juju_testmodel_2a0f06aa_tg_metrics-endpoint_8.rules
It's not enough for an alert group to be named differently: each alert definition must be unique on its own within the rule file, regardless of how it is nested under group names. Uniqueness is derived from alert name + alert labels. So the same alertname won't be flagged as duplicated, if the juju_unit
label in each is different.
Convert alert rules in relation data to yaml rule files:
juju show-unit prometheus/0 | yq '."prometheus/0".relation-info | .[] | select(.endpoint == "metrics-endpoint" and .related-endpoint == "downstream-prometheus-scrape") | .application-data.alert_rules' | jq | yq -p json -o yaml > alert.rules
The above would correspond to a single file on disk in the prometheus workload container.