Metrics not updated when a consumer group is not active
Hi @seglo.
I have the same problem => #36
Ex.
I have group consumers which read topic.
When I stop read topic I see on the chart empty data(no data)
When I run consumer group lag and I see data on chart
This is very critical becouse in this is time consumer lag not monitoring
any idea?
Version exporter 0.5.1
Poll interval: 10 seconds
Lookup table size: 8192
Prometheus metrics endpoint port: 8000
Admin client consumer group id: kafkalagexporter
Kafka client timeout: 10 seconds
Statically defined Clusters:
Cluster name: kafka_general
Cluster Kafka bootstrap brokers: broker-01:9092
Watchers:
Strimzi: false
2019-09-16 05:14:11,329 INFO c.l.k.KafkaClusterManager$ akka://kafka-lag-exporter/user - Cluster Added:
Cluster name: kafka_general
Cluster Kafka bootstrap brokers: broker-01:9092
2019-09-16 05:14:11,344 INFO c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-kafka_general - Spawned ConsumerGroupCollector for cluster: kafka_general
2019-09-16 05:14:11,355 INFO c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-kafka_general - Collecting offsets
2019-09-16 05:14:11,384 INFO o.a.k.c.admin.AdminClientConfig - AdminClientConfig values:
bootstrap.servers = [broker-01:9092]
client.dns.lookup = default
client.id =
connections.max.idle.ms = 300000
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 10000
retries = 0
retry.backoff.ms = 1000
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
2019-09-16 05:14:11,537 INFO o.a.kafka.common.utils.AppInfoParser - Kafka version: 2.2.1
2019-09-16 05:14:11,537 INFO o.a.kafka.common.utils.AppInfoParser - Kafka commitId: 55783d3133a5a49a
2019-09-16 05:14:12,105 INFO o.a.k.c.consumer.ConsumerConfig - ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = [broker-01:9092]
check.crcs = true
client.dns.lookup = default
client.id =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = kafkalagexporter
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 10000
retry.backoff.ms = 1000
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
session.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?
As I described #36, we only report data for groups that are returned by the Kafka AdminClient. Every poll interval we compare the groups returned from the last interval and this interval and then unregister the metrics for the groups that no longer exists. This is done so we don't accumulate groups in an unbounded manner.
I can see how this could be a problem in some cases, even though a consumer group is not active any more, you may still want to know how far behind it is regardless. In your use case are the consumer groups shut down intentionally or have they encountered an error? How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?
@seglo Thank a lot the answer
@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?
Yes of course. When last consumer leaving consumer group, the metrics are no longer reported and I see empty chart (prometheus console outputs no data)
I can see how this could be a problem in some cases, even though a consumer group is not active any more, you may still want to know how far behind it is regardless. In your use case are the consumer groups shut down intentionally or have they encountered an error?
My case is related to a production incident, when consumer group suddenly got stuck and we could not spot the ever-growing lag for this consumer group for several hours.
How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?
I think this behavior should be implemented as a custom user configuration, but by default it could equal to 5 minutes. This is similar to kafka broker properties named log.retention.hours, when a user himself can set this behavior according to his needs.
@efrikin Thanks for the reply.
@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?
Yes of course. When last consumer leaving consumer group, the metrics are no longer reported and I see empty chart (prometheus console outputs no data)
I see. I thought there was a longer grace period for the consumer group to stay active after the last member has left.
How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?
I think this behavior should be implemented as a custom user configuration, but by default it could equal to 5 minutes. This is similar to kafka broker properties named log.retention.hours, when a user himself can set this behavior according to his needs.
Yes. We could add a feature that retains group metadata for a configured interval of time after it's left the cache of the Consumer Group coordinator. If the group no longer exists in the next poll we can continue calculating the lag based on the last consumed offsets. If the group becomes active again then we continue as normal, if it doesn't become active then after an interval of time remove it from the metrics endpoint. Perhaps a default of 30 minutes would be a good value to start with. The one caveat is that if Kafka Lag Exporter is started after a group is no longer active, then it won't see it until it's active again.
@seglo Thanks for the reply.
Yes. We could add a feature that retains group metadata for a configured interval of time after it's left the cache of the Consumer Group coordinator. If the group no longer exists in the next poll we can continue calculating the lag based on the last consumed offsets. If the group becomes active again then we continue as normal, if it doesn't become active then after an interval of time remove it from the metrics endpoint.
This is good news.
Perhaps a default of 30 minutes would be a good value to start with.
For values by default 30 minutes a great starting.
The one caveat is that if Kafka Lag Exporter is started after a group is no longer active, then it won't see it until it's active again. This is no problem. If the exporter is restarted, we will see it via other metrics.
Would it be possible to include it in next release? I'll really appreciate that! Also, could you please let me know the date of the next release?
Thanks a lot!
@efrikin Thanks for clarifying. I will issue a release soon. There are several PRs in progress. I'll create a new issue for this one and work on it soon, unless someone else volunteers to do it first.
@seglo Thank a lot. I'll really appreciate that!
Just to chime in here. I do agree that lag calculations may not be applicable beyond a certain time if there is no active consumer groups. However, the kafka_partition_*_offset metrics should report regardless of consumer groups being active or not. This metric is not related to a consumer group but more a producer and we use it to ensure that we are getting new messages into the topic.
This metric is not related to a consumer group but more a producer and we use it to ensure that we are getting new messages into the topic.
The _offset metrics were originally exported because the data was already available while calculating group lag. That's why only partitions belonging to [active] groups are reported.
I understand the value you get from monitoring the latest offset of arbitrary partitions. It would require a poll of all topic partitions in a cluster. That could be many more partitions than would be desired, but if it were enabled through a feature flag I think it would be a fine addition.