kafka-lag-exporter icon indicating copy to clipboard operation
kafka-lag-exporter copied to clipboard

Export kafka_consumergroup_group_lag metric with NaN value made sum() resulted in NaN

Open locmai opened this issue 3 years ago • 4 comments

Describe the bug The exporter given the NaN values which will result Prometheus Sum into NaN as well for kafka_consumergroup_group_lag.

Additional context Hi, I came across a several similar issues: https://github.com/prometheus/memcached_exporter/issues/38 and the author of Prometheus said this himself.

I believe these are the source of the NaN value left so could we safely remove them:

https://github.com/lightbend/kafka-lag-exporter/blob/97ee9d46b6cb2fd42674bb90da9638484150510d/src/main/scala/com/lightbend/kafkalagexporter/ConsumerGroupCollector.scala#L250-L258

Or we could mark it with error label or something so we could drop them a relabel config.

locmai avatar Oct 11 '21 15:10 locmai

We started to observe this scenario too, from 0.6.5 to 0.6.8 versions :(

Is there any workaround available?

andreminelli avatar Jan 04 '22 12:01 andreminelli

BTW, maybe old client_ids, returning NaN, are being kept? The image below shows client ids from a single partition of a topic, and on that time we had 5 client processess "connected" to this topic. (The green and yellow lines are returning NaN)

image

andreminelli avatar Jan 04 '22 12:01 andreminelli

@andreminelli one of our work around is query with something like this:

sum(kafka_consumergroup_group_lag{} > 0)

So all of the NaN value will got dropped with >0. We still got weird results with the dev environments caused by too few points turned into NaN. The productions' graphs look great to us.

locmai avatar Jan 05 '22 03:01 locmai

@andreminelli one of our work around is query with something like this:

sum(kafka_consumergroup_group_lag{} > 0)

Thanks, @locmai . One point I haven't noted is that we use the kafka_consumergroup_group_max_lag_seconds metric, which seems to be affected by any topic having kafka_consumergroup_group_lag with NaN, and not kafka_consumergroup_group_lag directly...

I think my report is still related to this issue though.

andreminelli avatar Jan 05 '22 11:01 andreminelli