kafka-lag-exporter icon indicating copy to clipboard operation
kafka-lag-exporter copied to clipboard

Missing data in prometheus scrapes

Open ryan-dyer-sp opened this issue 4 years ago • 8 comments

Describe the bug Not sure what may be happening so reaching out. I've got a 1h graph of kafka_consumergroup_group_max_lag_seconds and kafka_consumergroup_group_max_lag. In the seconds graph there are chunks of data missing. The exporter is set to poll every 60 seconds with a 20s timeout. The ServiceMonitor is set to scrape the exporter every 30s. As our HPA relies on the seconds metric, the lack of metric results in the HPA believing the value is 0 and begins to scale down the clusters. perhaps there's a way for us to work around this issue through HPA, but I'm still concerned with why there appears to be missing data and why it only appears for some metrics(seconds) and not all the metrics that are scraped.

Is there any way to query the exporter for the metric window that it uses and see the times and metrics it knows about?

Using 0.6.4 of the exporter.

ryan-dyer-sp avatar Oct 15 '20 18:10 ryan-dyer-sp

Additional Info. I've enabled debug logging on the app and every time we get a gap in metrics we see the following in the logs:

MetaData(26)
2020-10-15 20:09:53,058 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-develop-apps - Collecting offsets
2020-10-15 20:09:53,058 DEBUG o.a.k.clients.admin.KafkaAdminClient akka://kafka-lag-exporter/user/consumer-group-collector-develop-apps - [AdminClient clientId=adminclient-1] Queueing Call(callName=findAllBrokers, deadlineMs=1602792613058) with a timeout 20000 ms from now.
2020-10-15 20:09:53,058 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Sending MetadataRequestData(topics=[], allowAutoTopicCreation=true, includeClusterAuthorizedOperations=false, includeTopicAuthorizedOperations=false) to b-1.kafka:9094 (id:
2020-10-15 20:09:53,060 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Queueing Call(callName=listConsumerGroups, deadlineMs=1602792613058) with a timeout 19998 ms from now.
2020-10-15 20:09:53,060 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Queueing Call(callName=listConsumerGroups, deadlineMs=1602792613058) with a timeout 19998 ms from now.
2020-10-15 20:09:53,060 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Queueing Call(callName=listConsumerGroups, deadlineMs=1602792613058) with a timeout 19998 ms from now.
2020-10-15 20:09:53,060 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Sending ListGroupsRequestData() to b-1.kafka:9094 (id: 1 rack: use1-az6). correlationId=178
2020-10-15 20:09:53,060 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Sending ListGroupsRequestData() to b-3.kafka:9094 (id: 3 rack: use1-az1). correlationId=179
2020-10-15 20:09:53,060 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Sending ListGroupsRequestData() to b-2.kafka:9094 (id: 2 rack: use1-az4). correlationId=180
2020-10-15 20:09:53,064 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Queueing Call(callName=findCoordinator, deadlineMs=1602792613064) with a timeout 20000 ms from now.
2020-10-15 20:09:53,065 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Sending FindCoordinatorRequestData(key='logstash', keyType=0) to b-1.kafka:9094 (id: 1 rack: use1-az6). correlationId=181
2020-10-15 20:09:53,066 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Queueing Call(callName=describeConsumerGroups, deadlineMs=1602792613064) with a timeout 19998 ms from now.
2020-10-15 20:09:53,066 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Sending DescribeGroupsRequestData(groups=[logstash], includeAuthorizedOperations=false) to b-1.kafka:9094 (id: 1 rack: use1-az6). correlationId=182
2020-10-15 20:09:53,069 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Queueing Call(callName=findCoordinator, deadlineMs=1602792613069) with a timeout 20000 ms from now.
2020-10-15 20:09:53,070 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Sending FindCoordinatorRequestData(key='logstash', keyType=0) to b-2.kafka:9094 (id: 2 rack: use1-az4). correlationId=183
2020-10-15 20:09:53,071 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Queueing Call(callName=listConsumerGroupOffsets, deadlineMs=1602792613069) with a timeout 19998 ms from now.
2020-10-15 20:09:53,071 DEBUG o.a.k.clients.admin.KafkaAdminClient  - [AdminClient clientId=adminclient-1] Sending OffsetFetchRequestData(groupId='logstash', topics=null, requireStable=false) to b-1.kafka:9094 (id: 1 rack: use1-az6). correlationId=184
2020-10-15 20:09:53,071 DEBUG o.apache.kafka.clients.NetworkClient  - [AdminClient clientId=adminclient-1] Using older server API v6 to send OFFSET_FETCH {group_id=logstash,topics=null,_tagged_fields={}} with correlation id 184 to node 1
2020-10-15 20:09:53,073 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-develop-apps - Received Offsets Snapshot:

Timestamp: 1602792593069
Groups: logstash
Earliest Offsets:
  Topic                                                           Partition  Earliest

Latest Offsets:
  Topic                                                           Partition  Offset

Last Group Offsets:
  Group                                                           Topic                                                           Partition  Offset


2020-10-15 20:09:53,073 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-develop-apps - Updating lookup tables
2020-10-15 20:09:53,073 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-develop-apps - Reporting offsets
2020-10-15 20:09:53,073 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-develop-apps - Clearing evicted metrics
2020-10-15 20:09:53,074 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-develop-apps - Polling in 60 seconds
2020-10-15 20:09:53,074 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-develop-apps - Received Meta data:
MetaData(15)

This also appears to be isolated to AWS MSK 2.4.1+ as our 2.2.1 clusters dont appear to experience this issue.

ryan-dyer-sp avatar Oct 15 '20 20:10 ryan-dyer-sp

@seglo Any thoughts on this?

ryan-dyer-sp avatar Oct 19 '20 13:10 ryan-dyer-sp

@seglo Any insight on this?

ryan-dyer-sp avatar Oct 23 '20 15:10 ryan-dyer-sp

When groups, topics, partitions are not available then in the subsequent scrape they are explicitly removed from the prometheus end point. This is done because there's nothing like a TTL on prometheus metrics, so if they're not removed then over time it can accumulate a lot of stale data for groups, topics and partitions that may no longer exist.

According to your logs it looks like all the data in the offset snapshot is empty. This will have a direct impact on derived metrics. The offset snapshot data depends on consumer groups to exist so it knows what groups to get details about. If for some reason the admin client is not returning consumer groups for a time, that might explain why the snapshot is empty.

How frequently do the consumers for your affected topics consume? Are they on a batch or are the long running? One thing you can do to troubleshoot outside of Kafka Lag Exporter is to set a cron job that polls consumer groups (using the kafka-consumer-groups.sh script) and see if none are present during the same time as the gaps are observed.

seglo avatar Oct 25 '20 15:10 seglo

@seglo Ugh sorry for delay in responding to this. The consumer is logstash which is constantly scraping data from the MSK and sending it to ES. The kafka-lag-exporter is set to poll the MSK every 60 seconds. I tried performing my own scrape of the offsets from the cluster on a regular basis and never saw that it had blank data.

ryan-dyer-sp avatar Nov 10 '20 20:11 ryan-dyer-sp

The consumer is logstash which is constantly scraping data from the MSK and sending it to ES.

Are you sure that logstash is keeping the consumer group active indefinitely? Is logstash a long-lived process or does it run on a batch?

seglo avatar Nov 16 '20 17:11 seglo

Logstash is a long lived process that runs indefinitely. Assuming the offsets are all caught up, it will poll the MSK every few seconds (<5) to determine if the offset has changed and whether or not to process that change.

ryan-dyer-sp avatar Dec 22 '20 13:12 ryan-dyer-sp

I see. For some reason the request to get partition metadata for the logstash group is returning nothing (or it's filtered out by allow/deny lists in kafka lag exporter). You could try adding a cron task to dump partition info using the bin/kafka-consumer-groups.sh script that comes with an Apache Kafka release to see if you observe the same gaps (lack of data).

Kafka lag exporter will explicitly remove metrics for partitions when no partition metadata is returned on a subsequent scrape where it existed in the scrape before. We've sketched a way to workaround this in #63. Is this something you might be interested in working on?

seglo avatar Dec 22 '20 16:12 seglo

As #63 is now closed, I guess this one may be closed too?

sverrehu avatar Oct 28 '22 13:10 sverrehu

@sverrehu Thanks for reviewing some of these open issues.

Yes. Many of these corner cases spurred ideas for new features, but unfortunately it's not something I can allocate a lot of time to. Everyone's welcome to pick up the work and submit a PR, of course.

seglo avatar Oct 30 '22 20:10 seglo