strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard
[Bug]: strimzi.resources metric is missing in new unidirectional topic operator
Bug Description
It is documented here: https://github.com/strimzi/proposals/blob/main/051-unidirectional-topic-operator.md#metrics
But it does not seem to be carried over from the old operator.
After upgrading, we are not able to see any strimzi_resource_state
metric for each topic, as we have had before.
Steps to reproduce
No response
Expected behavior
No response
Strimzi version
0.39.0
Kubernetes version
Kubernetes 1.27.7
Installation method
Yaml files
Infrastructure
Azure AKS
Configuration files and logs
No response
Additional context
No response
I seem to have it:
$ kubectl exec -ti my-cluster-entity-operator-6879c6b9d9-ccldp -c topic-operator -- curl -s localhost:8080/metrics | grep strimzi.resources
# HELP strimzi_resources Number of custom resources the operator sees
# TYPE strimzi_resources gauge
strimzi_resources{kind="KafkaTopic",namespace="myproject",selector="strimzi.io/cluster=my-cluster",} 1.0
I seem to have it:
$ kubectl exec -ti my-cluster-entity-operator-6879c6b9d9-ccldp -c topic-operator -- curl -s localhost:8080/metrics | grep strimzi.resources # HELP strimzi_resources Number of custom resources the operator sees # TYPE strimzi_resources gauge strimzi_resources{kind="KafkaTopic",namespace="myproject",selector="strimzi.io/cluster=my-cluster",} 1.0
Do you have any topics provisioned ? We are seeing that there is no population of this metric per topic as before.
We are using the metric to check status != 1 and the reason label to monitor reconcile errors for topics.
I will try to provide more data next weeek.
Ahh, ok. No, I do not have the per-topic metrics there. Just the counter. Not sure we want to keep these detailed metrics as they are hard to manage. But I guess that can be discussed when the issue is triaged.
I see your concern. As minimum we need to understand if there is any reconciliation issues. That is really hard to monitor with this feature removed. If we do not use the label together with "reason" we would need to extract it from logs, which would be a pain.
Perhaps it could be a configurable option?
Triaged on 21.3.2024: @fvaleri is going to take a look at this one.
Hi @cthtrifork, thanks for raising this.
It was decided to not provide this metric with UTO because it does not scale well (it's an additional metric for each managed topic). Additionally, we don't have anything similar for the other operators.
If we do not use the label together with "reason" we would need to extract it from logs, which would be a pain.
Why would you want to extract the reason from logs? My suggestion is to leverage the KT status. You can use strimzi.reconciliations.failed
metrics to be alerted, and then run a kubectl
command to detect failed KTs. Alternatively, you can run the kubectl
command periodically, and send an alert when it finds something.
A command similar to this one:
$ kubectl get kt -o custom-columns=TOPIC:.metadata.name,REASON:.status.conditions[0].reason,MESSAGE:.status.conditions[0].message,READY:.status.conditions[0].status | grep False
t1 NotSupported Replication factor change not supported, but required for partitions [] False
Perhaps it could be a configurable option?
Personally, I don't like the idea because metrics are supposed to track the system behavior and performance, not the state of every single managed resource. The UTO has optional metrics to track internal operations that you can use for performance tests or troubleshooting, but they are aggregated.
That said, let's see what others think.
@scholzj @ppatierno @tombentley
Hey, for me the scaling is valid point and should be considered as @fvaleri wrote it. But I also see the point of knowing from the alert already, which resource is having a problem. So to (maybe) make a compromise on that I would suggest to use kube-state-metrics as e.g. Flux does it for every custom resource: https://fluxcd.io/flux/monitoring/metrics/ . Using this could expose the metrics but without the overhead inside of the operator. This would also be a Kubernetes native way without doing some hacks via kubectl....
I did not know that kube-state-metrics
can be configured to monitor custom resources. But if it can do so, it sounds like it is just a question of someone putting it together and sharing/contributing the configuration?
I did not know that kube-state-metrics
can be configured to monitor custom resources. But if it can do so, it sounds like it is just a question of someone putting it together and sharing/contributing the configuration?
So I agree with @fvaleri to not provide these metrics because of the scaling but I also think that a contribution to Strimzi (maybe in the examples folder?) to provide a configuration to use kube-state-metrics
could be a really interesting thing. The doc seems to be pretty straightforward https://github.com/kubernetes/kube-state-metrics/blob/main/docs/customresourcestate-metrics.md
We can have a dedicated improvement issue or PR if you think the kube-state-metrics
example is necessary, but I would close this bug report. Wdyt?
We can have a dedicated improvement issue or PR if you think the
kube-state-metrics
example is necessary, but I would close this bug report. Wdyt?
Yes I will close this bug report. Thanks for assisting. We will look at kube-state-metrics!