franz-go icon indicating copy to clipboard operation
franz-go copied to clipboard

Align connects metrics to prevent no data scenarios

Open endorama opened this issue 2 years ago • 3 comments

Hello, while using this library and the kotel plugin for monitoring some consumers of Kafka clusters we noticed a pattern in the connection metrics that is creating some minor issues that we'd like to address.

We have an alert that monitors the value of messaging.kafka.connect_errors.count to inform us in case the number is out of the ordinary. We also want to alert in case this metric stop being sent by the consumers, as that means we have a blind spot in our monitoring that we need to address.

At the moment this metric is emitted only when connection errors happens, while a sibling metric in only emitted when the connection succeed.

https://github.com/twmb/franz-go/blob/a2d69ce0779058d0bd0fef5223ca81abf05b54fb/plugin/kotel/meter.go#L233-L249

Having 2 metrics means that both stop sending metrics at some point for an undefined period of time.

This is mostly not an issues for connects.count, as for working clusters it is emitted by kotel most of the time, but leaves wide gaps for connect_errors.count. From the monitoring tool, is impossible to establish during this gap if the metric stopped being received because there are no errors or because the instrumentation stopped working.

A possible solution for this issue but maintain the semantic, is to rely on a single metric with different attribute. For example, using messaging.kafka.connects.count with an outcome attribute with 2 values, success or failure.

Would you be open to accept a PR that changes the monitor to rely on messaging.kafka.connects.count as single metric with different attributes?

Thank you for reading this issue and for this library!

endorama avatar Jan 31 '24 16:01 endorama

I think adding attributes technically changes things? I don't know how monitoring otel works.

Does this same problem exist for write_errs and read_errs?

I think rather than change anything by default, another option could be introduced,

// WithMergedConnectsMeter merges the `messaging.kafka.connect_errors.count`
// meter into the `messaging.kafka.connect_errors.count` meter, adding an attribute
// "outcome" with the values "success" or "failure". This option may be used because
// ...
func WithMergedConnectsMeter() MeterOpt

twmb avatar Feb 07 '24 00:02 twmb

I think adding attributes technically changes things? I don't know how monitoring otel works.

I'm not sure I understand but let me try to clarify: a metric can have multiple attribute but would still be considered a single metric (source: OpenTelemetry Protocol data model).

Does this same problem exist for write_errs and read_errs?

I would say no because:

  • there are no "write" or "read" metrics to count successes.
  • write and read are 2 separate operations, and in the spirit of controlling metric cardinality it make sense to have them as 2 separate counters. (This opposed to connect and connectErrs where the operation is 1, connecting to the broker but represented by 2 metrics)

I think rather than change anything by default, another option could be introduced,

That would be fine for me and I can submit a PR so we can discuss the implementation. An alternative could be to consider this a breaking change and create a kotel/v2, duplicating the current plugin and applying the mentioned changes.

Let me know how would you prefer to move forward and thank you for looking into this.

endorama avatar Feb 09 '24 10:02 endorama

I'm not a big fan of v2, especially for only one small change. +1 to a PR

twmb avatar Feb 22 '24 20:02 twmb

Merged and tagged as plugin/kotel/v1.5.0, thanks!

twmb avatar Jun 08 '24 22:06 twmb