kafka-lag-exporter
kafka-lag-exporter copied to clipboard
Export kafka_consumergroup_group_lag metric with NaN value made sum() resulted in NaN
Describe the bug The exporter given the NaN values which will result Prometheus Sum into NaN as well for kafka_consumergroup_group_lag.
Additional context Hi, I came across a several similar issues: https://github.com/prometheus/memcached_exporter/issues/38 and the author of Prometheus said this himself.
I believe these are the source of the NaN value left so could we safely remove them:
https://github.com/lightbend/kafka-lag-exporter/blob/97ee9d46b6cb2fd42674bb90da9638484150510d/src/main/scala/com/lightbend/kafkalagexporter/ConsumerGroupCollector.scala#L250-L258
Or we could mark it with error label or something so we could drop them a relabel config.
We started to observe this scenario too, from 0.6.5 to 0.6.8 versions :(
Is there any workaround available?
BTW, maybe old client_ids, returning NaN, are being kept? The image below shows client ids from a single partition of a topic, and on that time we had 5 client processess "connected" to this topic. (The green and yellow lines are returning NaN)
@andreminelli one of our work around is query with something like this:
sum(kafka_consumergroup_group_lag{} > 0)
So all of the NaN value will got dropped with >0
. We still got weird results with the dev environments caused by too few points turned into NaN. The productions' graphs look great to us.
@andreminelli one of our work around is query with something like this:
sum(kafka_consumergroup_group_lag{} > 0)
Thanks, @locmai . One point I haven't noted is that we use the kafka_consumergroup_group_max_lag_seconds metric, which seems to be affected by any topic having kafka_consumergroup_group_lag with NaN, and not kafka_consumergroup_group_lag directly...
I think my report is still related to this issue though.