prometheus-k8s-operator
prometheus-k8s-operator copied to clipboard
[MetricsEndpointAggregator] Group name should be rendered using unit name, not app name
Bug Description
In #409 we changed the following signature:
-def _group_name(self, appname) -> str:
+def group_name(self, unit_name: str) -> str:
Note how the new signature expects to take a unit name instead of app name. That makes sense, because group names must be unique within a rules file, otherwise promtool complains.
So iiuc, we need to correct all the usages of that function to pass in a unit name instead of an app name:
https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L1858
https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L2120
https://github.com/canonical/prometheus-k8s-operator/blob/4ca83b670f72f964e60c0ce69bd0f66b47016aaa/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L2144
To Reproduce
Deploy prom + offer:
# k8s model
bundle: kubernetes
applications:
prom:
charm: prometheus-k8s
channel: edge
scale: 1
trust: true
--- # overlay.yaml
applications:
prom:
offers:
prom:
endpoints:
- metrics-endpoint
- receive-remote-write
acl:
admin: admin
Deploy cos-proxy + peripheral charms:
# lxd model
series: jammy
saas:
prom:
url: k8s:admin/cos.prom
applications:
cp:
charm: cos-proxy
channel: latest/stable
revision: 54
series: focal
num_units: 1
tg:
charm: telegraf
channel: latest/stable
revision: 75
ub:
charm: ubuntu
channel: latest/stable
revision: 24
series: focal
num_units: 1
relations:
- [cp:prometheus-rules, tg:prometheus-rules]
- [cp:prometheus-target, tg:prometheus-client]
- [cp:juju-info, tg:juju-info]
- [cp:downstream-prometheus-scrape, prom:metrics-endpoint]
- [tg:juju-info, ub:juju-info]
Now,
- Re-relate
tg - ub
- Re-relate
cp - prom
- Look at
juju show-unit prom/0
. You'll see that there are two duplicated group names with the same content. Confirm duplication withjuju debug-log -i unit-prom-0 --replay | grep "Validating"
. - (I haven't figured out yet why the first deployment is fine, but then any re-relate breaks it.)
Environment
- Juju 3.1.6
- Charms from edge, per bundles above
Relevant log output
unit-prom-0: 19:48:44.486 DEBUG unit.prom/0.juju-log metrics-endpoint:8: Validating the rules failed: b'error validating /tmp/tmpwl6ke64e/validate_rule.yaml: [226:3: groupname: "juju_testmodel_2a0f06a_tg_alert_rules" is repeated in the same file\ngithub.com/prometheus/prometheus/model/rulefmt.(*RuleGroups).Validate\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/prometheus/prometheus/model/rulefmt/rulefmt.go:90\ngithub.com/prometheus/prometheus/model/rulefmt.Parse\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/prometheus/prometheus/model/rulefmt/rulefmt.go:302\ngithub.com/canonical/cos-tool/pkg/tool.(*PromQL).ValidateRules\n\t/home/runner/work/cos-tool/cos-tool/pkg/tool/promql_transform.go:14\ngithub.com/canonical/cos-tool/cmd/root.glob..func2\n\t/home/runner/work/cos-tool/cos-tool/cmd/root/root.go:77\ngithub.com/urfave/cli/v2.(*Command).Run\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/command.go:169\ngithub.com/urfave/cli/v2.(*App).RunContext\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/app.go:341\ngithub.com/urfave/cli/v2.(*App).Run\n\t/home/runner/work/cos-tool/cos-tool/vendor/github.com/urfave/cli/v2/app.go:247\ngithub.com/canonical/cos-tool/cmd/root.Execute\n\t/home/runner/work/cos-tool/cos-tool/cmd/root/root.go:124\nmain.main\n\t/home/runner/work/cos-tool/cos-tool/main.go:9\nruntime.main\n\t/opt/hostedtoolcache/go/1.18.10/x64/src/runtime/proc.go:250\nruntime.goexit\n\t/opt/hostedtoolcache/go/1.18.10/x64/src/runtime/asm_amd64.s:1571]\n'
Additional context
Rules files from cos-proxy are named in the prometheus workload container after the upstream unit, for example telegraf - cos-proxy - prometheus
would be named after telegraf:
$ juju ssh -m k8s:admin/cos --container prometheus prom/0 ls /etc/prometheus/rules
juju_testmodel_2a0f06aa_tg_metrics-endpoint_8.rules
It's not enough for an alert group to be named differently: each alert definition must be unique on its own within the rule file, regardless of how it is nested under group names. Uniqueness is derived from alert name + alert labels. So the same alertname won't be flagged as duplicated, if the juju_unit
label in each is different.
Convert alert rules in relation data to yaml rule files:
juju show-unit prometheus/0 | yq '."prometheus/0".relation-info | .[] | select(.endpoint == "metrics-endpoint" and .related-endpoint == "downstream-prometheus-scrape") | .application-data.alert_rules' | jq | yq -p json -o yaml > alert.rules
The above would correspond to a single file on disk in the prometheus workload container.
We don't think anyone is still relating cos-proxy to telegraf or filebeat. So while this is a bug, we might not need to fix it. We should ask known users if this bug is relevant to them and close it if it does not affect anyone.
Since MetricsEndpointAggregator
is only used in cos-proxy
we should probably do this. we should also checks if the same behavior is happening with nrpe
, if it doesn't we should close this ticket with no action.