kafka-lag-exporter Metrics not updated when a consumer group is not active

Hi @seglo. I have the same problem => #36 Ex. I have group consumers which read topic. When I stop read topic I see on the chart empty data(no data) When I run consumer group lag and I see data on chart This is very critical becouse in this is time consumer lag not monitoring any idea? Version exporter 0.5.1

Poll interval: 10 seconds
Lookup table size: 8192
Prometheus metrics endpoint port: 8000
Admin client consumer group id: kafkalagexporter
Kafka client timeout: 10 seconds
Statically defined Clusters:

  Cluster name: kafka_general
  Cluster Kafka bootstrap brokers: broker-01:9092
     
Watchers:
  Strimzi: false
      
2019-09-16 05:14:11,329 INFO  c.l.k.KafkaClusterManager$ akka://kafka-lag-exporter/user - Cluster Added: 
  Cluster name: kafka_general
  Cluster Kafka bootstrap brokers: broker-01:9092
      
2019-09-16 05:14:11,344 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-kafka_general - Spawned ConsumerGroupCollector for cluster: kafka_general 
2019-09-16 05:14:11,355 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-kafka_general - Collecting offsets 
2019-09-16 05:14:11,384 INFO  o.a.k.c.admin.AdminClientConfig  - AdminClientConfig values: 
	bootstrap.servers = [broker-01:9092]
	client.dns.lookup = default
	client.id = 
	connections.max.idle.ms = 300000
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 10000
	retries = 0
	retry.backoff.ms = 1000
	sasl.client.callback.handler.class = null
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.login.callback.handler.class = null
	sasl.login.class = null
	sasl.login.refresh.buffer.seconds = 300
	sasl.login.refresh.min.period.seconds = 60
	sasl.login.refresh.window.factor = 0.8
	sasl.login.refresh.window.jitter = 0.05
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = https
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
 
2019-09-16 05:14:11,537 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka version: 2.2.1 
2019-09-16 05:14:11,537 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka commitId: 55783d3133a5a49a 
2019-09-16 05:14:12,105 INFO  o.a.k.c.consumer.ConsumerConfig  - ConsumerConfig values: 
	auto.commit.interval.ms = 5000
	auto.offset.reset = latest
	bootstrap.servers = [broker-01:9092]
	check.crcs = true
	client.dns.lookup = default
	client.id = 
	connections.max.idle.ms = 540000
	default.api.timeout.ms = 60000
	enable.auto.commit = false
	exclude.internal.topics = true
	fetch.max.bytes = 52428800
	fetch.max.wait.ms = 500
	fetch.min.bytes = 1
	group.id = kafkalagexporter
	heartbeat.interval.ms = 3000
	interceptor.classes = []
	internal.leave.group.on.close = true
	isolation.level = read_uncommitted
	key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
	max.partition.fetch.bytes = 1048576
	max.poll.interval.ms = 300000
	max.poll.records = 500
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 10000
	retry.backoff.ms = 1000
	sasl.client.callback.handler.class = null
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.login.callback.handler.class = null
	sasl.login.class = null
	sasl.login.refresh.buffer.seconds = 300
	sasl.login.refresh.min.period.seconds = 60
	sasl.login.refresh.window.factor = 0.8
	sasl.login.refresh.window.jitter = 0.05
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	session.timeout.ms = 10000
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = https
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer

Sep 16 '19 09:09 efrikin

@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?

As I described #36, we only report data for groups that are returned by the Kafka AdminClient. Every poll interval we compare the groups returned from the last interval and this interval and then unregister the metrics for the groups that no longer exists. This is done so we don't accumulate groups in an unbounded manner.

I can see how this could be a problem in some cases, even though a consumer group is not active any more, you may still want to know how far behind it is regardless. In your use case are the consumer groups shut down intentionally or have they encountered an error? How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?

Sep 16 '19 15:09 seglo

@seglo Thank a lot the answer

@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?

Yes of course. When last consumer leaving consumer group, the metrics are no longer reported and I see empty chart (prometheus console outputs no data)

I can see how this could be a problem in some cases, even though a consumer group is not active any more, you may still want to know how far behind it is regardless. In your use case are the consumer groups shut down intentionally or have they encountered an error?

My case is related to a production incident, when consumer group suddenly got stuck and we could not spot the ever-growing lag for this consumer group for several hours.

How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?

I think this behavior should be implemented as a custom user configuration, but by default it could equal to 5 minutes. This is similar to kafka broker properties named log.retention.hours, when a user himself can set this behavior according to his needs.

Sep 17 '19 06:09 efrikin

@efrikin Thanks for the reply.

@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?

Yes of course. When last consumer leaving consumer group, the metrics are no longer reported and I see empty chart (prometheus console outputs no data)

I see. I thought there was a longer grace period for the consumer group to stay active after the last member has left.

How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?

I think this behavior should be implemented as a custom user configuration, but by default it could equal to 5 minutes. This is similar to kafka broker properties named log.retention.hours, when a user himself can set this behavior according to his needs.

Yes. We could add a feature that retains group metadata for a configured interval of time after it's left the cache of the Consumer Group coordinator. If the group no longer exists in the next poll we can continue calculating the lag based on the last consumed offsets. If the group becomes active again then we continue as normal, if it doesn't become active then after an interval of time remove it from the metrics endpoint. Perhaps a default of 30 minutes would be a good value to start with. The one caveat is that if Kafka Lag Exporter is started after a group is no longer active, then it won't see it until it's active again.

Sep 17 '19 15:09 seglo

@seglo Thanks for the reply.

Yes. We could add a feature that retains group metadata for a configured interval of time after it's left the cache of the Consumer Group coordinator. If the group no longer exists in the next poll we can continue calculating the lag based on the last consumed offsets. If the group becomes active again then we continue as normal, if it doesn't become active then after an interval of time remove it from the metrics endpoint.

This is good news.

Perhaps a default of 30 minutes would be a good value to start with.

For values by default 30 minutes a great starting.

The one caveat is that if Kafka Lag Exporter is started after a group is no longer active, then it won't see it until it's active again. This is no problem. If the exporter is restarted, we will see it via other metrics.

Would it be possible to include it in next release? I'll really appreciate that! Also, could you please let me know the date of the next release?

Thanks a lot!

Sep 18 '19 05:09 efrikin

@efrikin Thanks for clarifying. I will issue a release soon. There are several PRs in progress. I'll create a new issue for this one and work on it soon, unless someone else volunteers to do it first.

Sep 18 '19 12:09 seglo

@seglo Thank a lot. I'll really appreciate that!

Sep 19 '19 04:09 efrikin

Just to chime in here. I do agree that lag calculations may not be applicable beyond a certain time if there is no active consumer groups. However, the kafka_partition_*_offset metrics should report regardless of consumer groups being active or not. This metric is not related to a consumer group but more a producer and we use it to ensure that we are getting new messages into the topic.

Aug 22 '20 14:08 ryan-dyer-sp

This metric is not related to a consumer group but more a producer and we use it to ensure that we are getting new messages into the topic.

The _offset metrics were originally exported because the data was already available while calculating group lag. That's why only partitions belonging to [active] groups are reported.

I understand the value you get from monitoring the latest offset of arbitrary partitions. It would require a poll of all topic partitions in a cluster. That could be many more partitions than would be desired, but if it were enabled through a feature flag I think it would be a fine addition.

Aug 24 '20 14:08 seglo