Burrow icon indicating copy to clipboard operation
Burrow copied to clipboard

Burrow showing abnormally high consumer Lag

Open Vikash08Mishra opened this issue 3 years ago • 3 comments

occasionally noticing burrow consumer lag metrics (burrow partition lag) suddenly spiking very high to say ten's of millions in a minute interval. And it would come down to normal value within another minute or so.

image

Above is observed across services and even for different clusters. It's granted that we don't expect that much data load suddenly for lag to increase in a minute duration. Neither are our consumers are currently capable enough to process ten's of million of data in a minute duration to bring down lag. Similar issue has been observed by other users as well in past(https://stackoverflow.com/questions/51534532/kafka-consumer-lag-monitoring-with-linkedin-burrow-jumps-intermittently)

  • I notice this issue when Kafka restarts and also when Burrow itself restarts (running burrow inside a container). But not every Kafka or burrow restart reproduces the issue. So I am suspecting this issue is related to how burrow interpret lag. below is my theory based on burrow design and the way it interprets offset, design reference: https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
  • Burrow stores both latest partition offset and consumer offset in offset storage module. It keeps updating that data and periodically publish the same based on difference of above 2 offsets.
  • When burrow restarts then storage is lost. Now when it starts, if it first fetches the latest partition offset and before it could fetch consumer offset, periodic interval occurs and metrics is emitted. In this scenario, partition offset will have a very high value where consumer offset would be zero. So, difference between latest partition offset and consumer offset would be very high explaining sudden high lag. Now by the time next metrics is emitted consumer offset is also fetched. So lag, current partition offset - consumer offset would be normal. This explains sudden drop in consumer lag in interval of a minutes or so.
  • There are chances that post burrow restart, before burrow emits metrics both consumer and partition offset is fetched. Hence not every restart results in sudden spike in consumer lag.

Can someone help me to confirm if above understanding of issue is correct? Or any other explanation/pointers on why the issue may be happening ?

Burrow version: 1.3.8 , same was observed in 1.3.6 as well. Kafka version: Observed for multiple Kafka versions - 2.6, 2.7.0, 2.8.1

Vikash08Mishra avatar Feb 04 '22 13:02 Vikash08Mishra

Hi team, Can someone help to confirm if above theory of how burrow interprets metrics is correct? or any other pointers explaining sudden very high consumer lag reported by burrow. Thanks.

Vikash08Mishra avatar Feb 28 '22 05:02 Vikash08Mishra

We also observe the same behaviour in Burrow when it restarts: a temporary large spike in lag.

Without looking at the code, I have basically the same theory as you for the cause. I wonder if better behaviour would be for Burrow to not emit any lag metrics until it's had a chance to poll at least once for consumer offsets.

andpol avatar Mar 23 '22 21:03 andpol

I've tried out the configuration introduced in this PR https://github.com/linkedin/Burrow/pull/488, which enables consuming the earliest offsets first (OffsetNewest consuming type in Kafka), and back-fills historical data in parallel. It really helped in small clusters, but the problem still persists in big ones.

Sipleman avatar Dec 12 '22 04:12 Sipleman