Burrow icon indicating copy to clipboard operation
Burrow copied to clipboard

Polling the HTTP /lag endpoint causes strange results

Open alexchowle opened this issue 10 months ago • 1 comments

This maybe my limited understanding, here, but it's worth me writing this down to see if I'm misguided or have found an issue. My setup is 1 x Burrow instance consuming a single Kafka cluster. I have a polling script that calls the /lag endpoint periodically and forwards the retrieved JSON payload to a timeseries DB for visualisation and alerting.

After a period of ~36 hours I start to see a large increase in the calculated current_lag value per Topic Partition. I can find no natural explanation for this lag in terms of an increase of messages produced, or a slowdown in the Consumers within the Group. What I have found is:

  • Restarting the poller makes the current_lag reset to zero.
  • Starting a second poller alongside causes different current_lag values to be reported from the first poller.

To be clear: I have no fancy logic in the pollers - they simply perform a HTTP GET on v3/kafka/$CLUSTER/consumer/$CONSUMER_GROUP/lag and report the current_lag per Partition.

Have I misconfigured something?

This is Burrow v1.8.0

alexchowle avatar Nov 28 '24 13:11 alexchowle