strimzi-kafka-operator [Bug]: strimzi.resources metric is missing in new unidirectional topic operator

Bug Description

It is documented here: https://github.com/strimzi/proposals/blob/main/051-unidirectional-topic-operator.md#metrics

But it does not seem to be carried over from the old operator. After upgrading, we are not able to see any strimzi_resource_state metric for each topic, as we have had before.

Steps to reproduce

No response

Expected behavior

No response

Strimzi version

0.39.0

Kubernetes version

Kubernetes 1.27.7

Installation method

Yaml files

Infrastructure

Azure AKS

Configuration files and logs

No response

Additional context

No response

Mar 08 '24 16:03 cthtrifork

I seem to have it:

$ kubectl exec -ti my-cluster-entity-operator-6879c6b9d9-ccldp -c topic-operator -- curl -s localhost:8080/metrics | grep strimzi.resources
# HELP strimzi_resources Number of custom resources the operator sees
# TYPE strimzi_resources gauge
strimzi_resources{kind="KafkaTopic",namespace="myproject",selector="strimzi.io/cluster=my-cluster",} 1.0

Mar 08 '24 16:03 scholzj

I seem to have it:


$ kubectl exec -ti my-cluster-entity-operator-6879c6b9d9-ccldp -c topic-operator -- curl -s localhost:8080/metrics | grep strimzi.resources

# HELP strimzi_resources Number of custom resources the operator sees

# TYPE strimzi_resources gauge

strimzi_resources{kind="KafkaTopic",namespace="myproject",selector="strimzi.io/cluster=my-cluster",} 1.0

Do you have any topics provisioned ? We are seeing that there is no population of this metric per topic as before.

We are using the metric to check status != 1 and the reason label to monitor reconcile errors for topics.

I will try to provide more data next weeek.

Mar 08 '24 17:03 cthtrifork

Ahh, ok. No, I do not have the per-topic metrics there. Just the counter. Not sure we want to keep these detailed metrics as they are hard to manage. But I guess that can be discussed when the issue is triaged.

Mar 08 '24 17:03 scholzj

I see your concern. As minimum we need to understand if there is any reconciliation issues. That is really hard to monitor with this feature removed. If we do not use the label together with "reason" we would need to extract it from logs, which would be a pain.

Perhaps it could be a configurable option?

Mar 08 '24 17:03 cthtrifork

Triaged on 21.3.2024: @fvaleri is going to take a look at this one.

Mar 21 '24 16:03 ppatierno

Hi @cthtrifork, thanks for raising this.

It was decided to not provide this metric with UTO because it does not scale well (it's an additional metric for each managed topic). Additionally, we don't have anything similar for the other operators.

If we do not use the label together with "reason" we would need to extract it from logs, which would be a pain.

Why would you want to extract the reason from logs? My suggestion is to leverage the KT status. You can use strimzi.reconciliations.failed metrics to be alerted, and then run a kubectl command to detect failed KTs. Alternatively, you can run the kubectl command periodically, and send an alert when it finds something.

A command similar to this one:

$ kubectl get kt -o custom-columns=TOPIC:.metadata.name,REASON:.status.conditions[0].reason,MESSAGE:.status.conditions[0].message,READY:.status.conditions[0].status | grep False
t1      NotSupported   Replication factor change not supported, but required for partitions []   False

Perhaps it could be a configurable option?

Personally, I don't like the idea because metrics are supposed to track the system behavior and performance, not the state of every single managed resource. The UTO has optional metrics to track internal operations that you can use for performance tests or troubleshooting, but they are aggregated.

That said, let's see what others think.

@scholzj @ppatierno @tombentley

Mar 21 '24 17:03 fvaleri

Hey, for me the scaling is valid point and should be considered as @fvaleri wrote it. But I also see the point of knowing from the alert already, which resource is having a problem. So to (maybe) make a compromise on that I would suggest to use kube-state-metrics as e.g. Flux does it for every custom resource: https://fluxcd.io/flux/monitoring/metrics/ . Using this could expose the metrics but without the overhead inside of the operator. This would also be a Kubernetes native way without doing some hacks via kubectl....

Mar 21 '24 21:03 sebastiangaiser

I did not know that kube-state-metrics can be configured to monitor custom resources. But if it can do so, it sounds like it is just a question of someone putting it together and sharing/contributing the configuration?

Mar 21 '24 21:03 scholzj

I did not know that kube-state-metrics can be configured to monitor custom resources. But if it can do so, it sounds like it is just a question of someone putting it together and sharing/contributing the configuration?

Mar 21 '24 21:03 scholzj

So I agree with @fvaleri to not provide these metrics because of the scaling but I also think that a contribution to Strimzi (maybe in the examples folder?) to provide a configuration to use kube-state-metrics could be a really interesting thing. The doc seems to be pretty straightforward https://github.com/kubernetes/kube-state-metrics/blob/main/docs/customresourcestate-metrics.md

Mar 22 '24 08:03 ppatierno

We can have a dedicated improvement issue or PR if you think the kube-state-metrics example is necessary, but I would close this bug report. Wdyt?

May 20 '24 08:05 fvaleri

We can have a dedicated improvement issue or PR if you think the kube-state-metrics example is necessary, but I would close this bug report. Wdyt?

Yes I will close this bug report. Thanks for assisting. We will look at kube-state-metrics!

May 22 '24 10:05 cthtrifork

strimzi-kafka-operator strimzi-kafka-operator copied to clipboard

[Bug]: strimzi.resources metric is missing in new unidirectional topic operator

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Additional context

strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard