keda icon indicating copy to clipboard operation
keda copied to clipboard

Provide operational insights on # of triggers per trigger type

Open tomkerkhove opened this issue 2 years ago • 14 comments

Proposal

Based on the current docs, we provide the following metrics today:

  • keda_metrics_adapter_scaler_error_totals - The total number of errors encountered for all scalers.
  • keda_metrics_adapter_scaled_object_error_totals - The number of errors that have occurred for each scaled object.
  • keda_metrics_adapter_scaler_errors - The number of errors that have occurred for each scaler.
  • keda_metrics_adapter_scaler_metrics_value - The current value for each scaler’s metric that would be used by the HPA in computing the target average.

Based on these, I don't see a good fit to extend them with good labels. Another reason for it is that ScaledObjects/ScaledJobs with multiple triggers of the same type would not be reflected correctly.

That's why I propose to introduce keda_metrics_adapter_trigger_totals metric with a type label that is the type of trigger being used, for example cron.

Use-Case

Gain insights on how many triggers are using a given trigger type.

This helps you better understand your autoscaling landscape and its dependencies.

Anything else?

No response

tomkerkhove avatar Sep 14 '22 07:09 tomkerkhove

Would like to hear some thoughts on this from @JorTurFer @zroubalik

v-shenoy avatar Sep 14 '22 08:09 v-shenoy

I think that we can add these metrics about adoption, it's true that from monitoring pov could not give a lot of information, but from adoption pov it gives. I mean, as a cluster operator I could want to know the aggregated values of trigger types to apply some improvements, .e.g: If I only have 1 azure scaler, maybe doesn't make sense to use managed identities. This also gives useful information about what that cluster is used for.

In general, I don't have any problem with improving the metrics we offer because they are private inside your cluster and we don't collect them so each admin is able to collect them or not.

JorTurFer avatar Sep 14 '22 16:09 JorTurFer

I can work on implementing this.

v-shenoy avatar Sep 14 '22 19:09 v-shenoy

Reposting a query from Slack -

The value for this metric might change every time a ScaledObject / ScaledJob is reconciled, as the triggers for them might be changed by the user. Considering the reconciliation happens in the operator, I am not sure how best to expose the Prometheus metrics through the metrics adapter. Need some guidance on this.

@zroubalik @tomkerkhove @JorTurFer

v-shenoy avatar Sep 26 '22 07:09 v-shenoy

I'd expose them on the operator instead of the metrics adapter as it's not related to the metric adapter at all. So I'd introduce a new endpoint there

tomkerkhove avatar Sep 26 '22 08:09 tomkerkhove

Well, if we're okay with exposing a new endpoint on the operator, then it shouldn't be a big task (I think?). Should the metric be named keda_operator_trigger_totals then?

v-shenoy avatar Sep 26 '22 08:09 v-shenoy

Don't expose another endpoint in the operator please, we are already exposing a single endpoint with the runtime-metrics (operator-sdk does it). Recently we have talked about unifying them in the metrics server, so let's do it directly in this case

JorTurFer avatar Sep 26 '22 08:09 JorTurFer

Basically, instead of starting another server, you should register the metrics in the already existing server by prometheus global registry https://book.kubebuilder.io/reference/metrics.html#publishing-additional-metrics

JorTurFer avatar Sep 26 '22 08:09 JorTurFer

Basically, instead of starting another server, you should register the metrics in the already existing server by prometheus global registry https://book.kubebuilder.io/reference/metrics.html#publishing-additional-metrics

The link mentions this -

You may then record metrics to those collectors from any part of your reconcile loop. These metrics can be evaluated from anywhere in the operator code.

But the reconcile loop is running within a separate pod, so this won't work in our case? Or am I misunderstanding something from the docs?

v-shenoy avatar Sep 26 '22 08:09 v-shenoy

You asked about adding this metric to the operator, there you have reconciliation loops xD In the metrics-server already have 2 Prometheus metric servers (which we should unify)

JorTurFer avatar Sep 26 '22 09:09 JorTurFer

Oh, so you're suggesting that we should reuse the Prometheus endpoint used for the runtime metrics in the operator pod, instead of exposing another one.

I just got confused and thought that you wanted to not expose the metrics in the operator and expose them in the adapter itself.

v-shenoy avatar Sep 26 '22 09:09 v-shenoy

Don't expose another endpoint in the operator please, we are already exposing a single endpoint with the runtime-metrics (operator-sdk does it). Recently we have talked about unifying them in the metrics server, so let's do it directly in this case

Agreed, sorry I was more referring to exposing them on the operator rather than introducing a new endpoint.

tomkerkhove avatar Sep 26 '22 09:09 tomkerkhove

Lost in translation xD I meant , let's reuse the current Prometheus endpoint in the operator and avoid duplications like in the metrics server

JorTurFer avatar Sep 26 '22 09:09 JorTurFer

Each one of us misunderstood the others 😄

v-shenoy avatar Sep 26 '22 09:09 v-shenoy