Burrow icon indicating copy to clipboard operation
Burrow copied to clipboard

Improve lag evaluation rules

Open rconn01 opened this issue 7 years ago • 21 comments

I'm opening this issue to start conversation around possible improvement for the lag detection rules.

Currently Burrow does a great job of detecting lag when a consumer commits at a defined interval. This allows burrow to have a defined window and all the rules work pretty well. If a consumer writes its offsets every second, burrow will always have a window of 10 seconds. If a consumer only writes its offsets when it consumes data, burrow starts to create pretty random windows which can lead to unpredictable results

Detecting a Stopped consumer in a reasonable time

If there is a low throughput topic and a consumer has built a window (time between first commit and last) of 2 hours, it will take at least that long for a stop to be detected. Current lag will keep increasing, but no commits in the window are changing so burrow never detects a stop.

Would it make sense to have a rule that applies logic against current lag to help improve this case. Something like if the last X checks to current lag is increasing or the same (non zero) consider the consumer stopped

False positive of a stopped consumer

If there is a low throughput topic, and the consumer on that topic only writes offsets when it consumes, burrow can give a false positive of lag. Let's say that a topic only gets data at night. The last offset it consumed gives the group a window (last commit time minus first) of 20 minutes. The consumer than goes for hours without getting anydata. Burrow doesn't report the group in error because current lag is zero. Based on timing of events the consumer could be reported as lagging when the first message comes in. Burrow updates its offset from the broker before the consumer commits. Rule 4 will mark the consumer as stopped since current lag is no longer zero. The lag immediately clears after the first commit comes in

Match minDistance to offsetRefresh

Would it make sense to remove the configuration for minDistance and have it always match the offsetRefresh interval. If consumers are writing faster than the refresh interval, they will always look like they are making progress at some point in the window, and lag might not ever show. This is best shown with an example

With this example 10 messages a second are being written and the consumer is consuming at a rate of 1 record per second, writing offsets every 2 seconds. Burrow is refreshing offsets every 10 secoonds

Time T0 T5 T10 T12 T14 T20
Head 0 50 100 120 140 200
Committed offset 0 4 10 12 14 20
Burrow head 0 0 100 100 100 200
Burrow lag 0 0 90 88 86 180
Actual lag 0 46 90 108 126 180

Head = latest offset on the partition log Committed Offset = The most recent commit the consumer has written Burrow head = The offset that burrow thinks is the head of the partition Burrow lag = The lag that burrow calculated Actual lag = the amount that the consumer is actually lagging.

With this example you can see it will look like the consumer is always making progress in the window, even though it is lagging.

I think an improvement might be to make sure that minDistance is >= to the offset refresh interval. This way the window can only advance once per offset refresh.

rconn01 avatar Dec 06 '17 23:12 rconn01

So, on the first one, I understand the situation you're describing. However, there's no state stored between evaluations at present. There was a request for something like this in #223, but it's a non-trivial amount of work.

On the second one, the case you're describing is unlikely to happen. Since leading into getting the data, the lag would be zero, even if there is suddenly lag it will be considered OK by burrow. Only when all the zero lag offset commits roll out will the partition be marked as lagging (even if current lag is positive).

On the last one, definitely there are issues when the consumer is committing offsets faster than we are fetching offsets from the broker. However, you're talking about configurations in two different places. offset-refresh is set per-cluster, whereas min-distance is set on the storage module. There isn't a clear answer for which cluster value to use, and it's very easy to end up with unintended results here. While it would be reasonable to warn of this configuration somewhere in the app, doing so would create a fragile link between the cluster modules and the storage modules.

toddpalino avatar Dec 12 '17 20:12 toddpalino

For the first one, I'm not sure it's exactly the same issue that's reported in that issue, but it might be the same work to make both of them possible. I think I was talking about the opposite of that issue. That issue seems to be talking about lag alerting too quickly. I was referring to an issue where rule 4 takes too long to kick in because of the slow rate of offset commits. CurrentLag could be increasing for hours, but because of the offset window time, a stopped alert will never be produced from burrow. To solve the issue I was thinking maybe a window of current lag could be built, similar to how a window is built with the consumer lag. Rule number 3 could then be applied to the current lag window as well. This way if the currentLag is increasing over the last 10 checks, you can get a stopped alert. This way the notifier check interval drives the time to alert and not just the consumer write interval, which has stopped. Thoughts?


For the second one. It's true it will be considered ok with 0's in the window. The issue is rule #4 wins first. Here is an example.

T1: Consumer writes an offset (lag is 0) T2: Consumer writes an offset (lag is0) ..... T10: Consumer writes an offset (lag is 0)

At this point a window size of 10 minutes is built up. Let's assume the the partition stops receiving data for 20 minutes.

T:39 Head minus tail is 10 minutes. Time now minus last offset is 19. This would trigger rule four, but since current lag is 0, rule 4 is not triggered.

T:40 Data is sent to partition, and broker offsets are refreshed. The consumer has not written it's offset yet. This leaves currentLag at 1. This now triggers rule 4 Head minus tails is 10 minutes which is less than Time now minus last offset time (20) and current lag is 1.

This lag goes away at the next notifier interval, but the timing is enough to trigger false positives


For the third. I agree it would be a little strange to tie the two configs together as they are today. Would it make sense to enforce it automatically, or maybe introduce a new config at the cluster level. Since the the consumer has a reference to the cluster, maybe it gets enforced at the consumer level. The consumer could read the offset, but only write to the storage tier at a minimum frequency that aligns with the refresh level.

rconn01 avatar Dec 20 '17 15:12 rconn01

@toddpalino any thoughts here on what might be the best way to improve some of these?

We see these issues specifically with the open source mirror maker product that ships with kafka. Since Mirror Maker only commits on timed intervals for partitions that have data, these timing issues show pretty frequently, leading to many false positives. Are there a recommend set of configs that should be used with mirror maker or are there other things that can be done to improve the monitoring of consumer lag from mirror maker?

rconn01 avatar Jan 18 '18 14:01 rconn01

I'm surprised that you're seeing a lot of false positives with mirror maker, as we use it all over the place, with a variety of topics (many thousands) without any real problems. Let me share our basic configs here:

[general]
pidfile="logs/burrow.pid"

[logging]
filename="logs/burrow.log"
level="info"
maxsize=100
maxbackups=100
maxage=7
use-compression=true

[tls.riddler]
certfile="SOMEFILE"
keyfile="SOMEFILE"
cafile="SOMEFILE"

[client-profile.default]
client-id="burrow-i001"
tls="riddler"

[zookeeper]
servers=["zk-hostname.linkedin.com:12345"]
timeout=30
root-path="/burrow-test-tpalino"

[httpserver.default]
address=":28765"
timeout=300

[storage.default]
class-name="inmemory"
workers=20
intervals=15
expire-group=604800
min-distance=1

[cluster.metrics]
class-name="kafka"
client-profile="default"
servers=["kafka-hostname.linkedin.com:9021"]
topic-refresh=120
offset-refresh=30

[consumer.metrics]
class-name="kafka"
cluster="metrics"
client-profile="default"
servers=["kafka-hostname.linkedin.com:9021"]
offsets-topic="__consumer_offsets"
group-blacklist="^((console-consumer-|python-kafka-consumer-|quick-|orca-client-).*|.*(.linkedin.com|-mn[1-9]|-ld[1-9])-[0-9]+)$"
group-whitelist=""
start-latest=true

[consumer.metrics-zk]
class-name="kafka_zk"
cluster="metrics"
servers=["zk-hostname.linkedin.com:12345"]
zookeeper-path="/kafka-metrics"
zookeeper-timeout=30
group-blacklist="^((console-consumer-|python-kafka-consumer-|quick-|orca-client-).*|.*(.linkedin.com|-mn[1-9]|-ld[1-9])-[0-9]+)$"
group-whitelist=""

Now, having said that, we also push the results out to another alerting system (with counts for partitions in various states), and use that system for the actual alerting. It waits for a few cycles of error before alerting.

toddpalino avatar Jan 19 '18 17:01 toddpalino

I'm going to guess your topics have constant throughput? We see this on topics that have low throughput, or data at specific times of the day. Specifically we see scenario number 2 explained above.

We have a good number of topics that only get data at night. While the topic is getting data, offsets are getting written on the mirrorMaker offset commit interval. This ends up building a window of say 100 seconds. The topic stops getting data for the night. Once the topic starts getting data again the next night, based on timing, the broker head is checked for before mirror maker writes its offset, which gives a current lag greater than 0. Since current lag is greater than 0 and the window is small from when we were getting data, we get a false positive alert.

If the topic had constant data on it I don't think the false positive would happen.

rconn01 avatar Jan 19 '18 21:01 rconn01

Our topics range all over, from no data at all to rapid fire. That said, I think we're starting to see some of this now with a recent deployment internally. Since we were not seeing it until just now, I need to review the changes we picked up to see if it's something specific.

toddpalino avatar Jan 24 '18 21:01 toddpalino

Alright, so specifically, #292 caused the problem here (which, coincidentally, was the fix for #290, which was also an issue you reported). We hadn't seen it internally because we were still running a pre-release branch from the end of November. It's very difficult to cover both cases here - you have a consumer that has not committed offsets in a while, and then suddenly there is new data. We don't know if the consumer has stopped committing offsets because it is broken, or if it's just slow getting to the new data that has come in. And without adding some sort of pause per-partition, where we say "OK, we got one bad check on this partition, but let's not call it a problem until we get a second bad check," it's hard to prevent a false alert here.

That solution becomes notifier-driven, which isn't bad, but it gets complicated. It's the evaluator that would have to store the state between checks, and the evaluator only has a cache of the last result which wouldn't be helpful here (we either return the last result if it's cached or reevaluate - we'd need to cache for longer and significantly changes the evaluation). I'm not certain if there's a way to reliably detect this in a single evaluation cycle.

toddpalino avatar Jan 24 '18 22:01 toddpalino

After having chewed this over a bit while on a ground hold at SFO trying to get home, I think the window of CurrentLag may be the solution to the problem as well. We can then reason that the window of CurrentLag is a fixed time interval (number of intervals times the refresh interval). So if the CurrentLag is ever 0, then a partition would not be stopped. It would slow down detection of stopped partitions, but not by an overwhelming amount, and I think it's the best compromise.

toddpalino avatar Jan 24 '18 23:01 toddpalino

Yeah I was just typing up a response proposing that again. The best solution I have come across is keeping a window of current lag, because like you say we always have a defined window size since the check is being driven by a timer instead of the clients. I think the biggest win with a current lag window is that it decouples the logic from what the clients are doing.

I also wonder if it makes sense to only have a window of current lag instead of a window driven by the client. I think the window with current lag helps solve the 3 issues I described here, since most of them are timing issues. I also think you the window with current lag would still be able to detect the same cases that are detected today with the client window, but obviously there are many use cases that would need to be thought through and I probably don't know the half of them

rconn01 avatar Jan 24 '18 23:01 rconn01

Keeping a window of current lag is a little more difficult, however, as it requires the consumer information to be iterated over and updated for each broker offset (there's no mapping of topic-partitions to consumer groups at this time). Right now I'm trying to work through if this can be accomplished with a history of broker offsets, similar to how we store a history of consumer offsets. I think it can be done, but it requires a little more finesse with the logic.

toddpalino avatar Jan 25 '18 01:01 toddpalino

(Apologize for the frequent comments - I like to talk out problems and solutions as it both helps me to reason through it as well as lets me get perspective from others)

If we store a history of broker offsets, I believe we can work out the second problem you described, the one that's plaguing us right now. When evaluating the consumer's status for a partition, we can calculate the lag of the latest committed offset against the entire window of broker offsets. If was 0 at any point over that window, we consider the partition to be in a good state. Since the window of broker offsets will be a short period of time, this gives us a buffer before we declare the consumer to be stopped. I think this is a good idea, and I'm going to work on the code right now (since I've had to revert our deployment internally because of the false alerts)

So let's go back to the first issue you described - detecting stopped consumers faster. Given the history of broker offsets, we could now say that If the lag at each point in the broker offset history is non-zero, and the consumer hasn't committed offsets over that entire history, the consumer is stopped. But is it? We don't really know when the consumer is supposed to commit offsets. If the broker window is 5 minutes (10 offsets at 30 second intervals) and the consumer has a 10 minute offset commit, this would be unreasonable. I'm not yet convinced that there's a good way to detect stopped consumers faster

toddpalino avatar Jan 25 '18 01:01 toddpalino

Keep the comments coming, helps me understand as well. Also, sorry for delays here and the long winded response as well. I've been trying to wrap my head around more of the code base.

I want to make sure I understand the difference between what you are describing as current lag window and broker history. To me they feel like the same thing. Specifically this statement similar to how we store a history of consumer offsets Isn't that history part of the lag window today?

If I understand correctly I think what you are saying is:

CurrentLag -- This would be calculated on every broker offset refresh. When the broker offset is refreshed, current lag would be calculated based on the last offset that we have seen from the consumer. The problem is, there is (currently) no easy way to see which consumer offsets need to be checked to be able to build the currentLag window for each consumer.

BrokerHistory -- For each topicPartition a list of head offsets are kept. When the rules are evaluated that history is used to evaluate a new rule. The new rule will take the last offset seen from the consumer and compare it against every broker offset in history. So instead of having a moving window of current lag, one will be calculated by the notifier at that point in time, based on the last offset seen from the client.

Assuming I have the above correct: I agree what you have proposed will help solve the current issue we are describing. I think it might introduce an issue of its own though. It might prevent rule number 3 (lag continually increasing) from firing.

Here's a quick example

Time Broker offset Client Offset Lag Lag to broker history
1 5 3 2 0
2 10 6 4 0
3 15 9 6 0
4 20 12 8 0
5 25 15 10 0
6 30 18 12 0
7 35 21 14 5
8 40 24 16 10
9 45 27 18 15
10 50 30 20 20

The Lag is always increasing, but since the current offset calculated against the broker history results in a 0 it will end up being ok. I haven't thought through all the case, but I think in most cases the 0 will eventually drop out, so that might be fine, as it would just delay when the issue is detected, but wanted to call it out.

This is where I think the current lag window helps, since the lag is calculated at the point of time when the offset was fetched, instead of later on.

I suppose the lag window and the brokerHistory could be used together for the new rule you are describing. For each offset in the brokerHistory use the most recent consumer offset without going passed the timestamp of the broker offset.


As for detecting a stopped consumer faster, you do bring up an interesting point and example about how do we know if a consumer is stopped or just hasn't committed. I think at some point if a consumer hasn't committed and there is no data it has to be considered stopped, otherwise we can never know accurately. In my mind, with the example you have above, if a consumer hasn't committed in 10 minutes and there is data, I would think something is wrong with that consumer group that needs investigation. If consumers are committing at that large of intervals does burrow as a whole serve much value? It seems alerts would be extremely delayed. I obviously don't know all use cases, but I think what you describe would be a reasonable trade off being made to detect stopped consumers faster.


The idea in my head was to basically flip everything. So for each group there would be a window of currentLag and a single point for client offset. On a timed interval the broker offsets are fetched and currentLag is calculated for each consumer, based on the last received offset. All of the current rules are applied against this window just like they are applied against the client window today. This would allow there to be a defined window size for all consumers. Unfortunately, It sounds like this might be pretty hard to pull off with the way the code is currently structured though

rconn01 avatar Jan 25 '18 16:01 rconn01

My plan was not to check the broker history of offsets except for potentially stopped partitions, so the increasing lag rule would never come into play here.

As far as long offset commit intervals goes, it's application dependent. The offset commit interval is essentially an indication of how long you're willing to have a consumer problem - the longer the interval, the longer you're willing to have a problem before seeing it. Burrow works just fine regardless of the interval, it's just that it takes longer to detect problems if you're taking longer to commit offsets. That was one of the designed features - to not have to tightly couple Burrow's configuration to the client configuration, and to be able to handle different types of consumers.

While the idea of flipping the offsets around seems interesting, I think it ultimately causes more problems than it fixes. The lag windows become tightly tied to the broker refresh intervals, which means you now need to worry about when the consumers are committing offsets. The alerting becomes less about the state of the consumer and more about the lag itself, which is what Burrow tries to avoid. What you're proposing works for a fixed type of consumer client, but seems to fail in the general case.

toddpalino avatar Jan 25 '18 19:01 toddpalino

Makes sense, in that case I'm good with the proposal of broker history to solve the case of false positives that we are seeing.

As for the coupling of the consumers -- totally agree Burrow should be unaware of consumer. Curious how you guys deal with the other two issues I described. Detecting a stopped consumer faster, and consumers committing faster than the broker refresh rate. Specifically for the min distance issue, the only thing I've been able to do is couple the min distance config and be aware of how fast consumers are committing. Since you have such a large deployment, wondering how you have avoided running into either of those issues

rconn01 avatar Jan 25 '18 23:01 rconn01

On the speed of detecting a stopped consumer, we just suck it up. If the consumer has chosen long offset commit times, it's on them, and we clearly call out how that impacts detection windows.

As far as consumers committing faster than the broker refresh window, we've been extending the min-distance config to account for this now that we're seeing it (up until recently, we didn't have a consumer we cared that much about that did this). That seems to cover it, so it's really just a matter of making sure our configs for Burrow are sane. Since we control them, it's not hard.

toddpalino avatar Jan 29 '18 22:01 toddpalino

I can't tell if this is a related issue or not, but I'm seeing stuff like the following when I stop a consumer:

kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="0",topic="TESTdelete_station"} 4
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="1",topic="TESTdelete_station"} 4
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="2",topic="TESTdelete_station"} 5
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="3",topic="TESTdelete_station"} 5
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="4",topic="TESTdelete_station"} 4
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="0",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="1",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="2",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="3",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="4",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="0",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="1",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="2",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="3",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="4",topic="TESTdelete_station"} 0
kafka_burrow_topic_partition_offset{cluster="stage",partition="0",topic="TESTdelete_station"} 6
kafka_burrow_topic_partition_offset{cluster="stage",partition="1",topic="TESTdelete_station"} 5
kafka_burrow_topic_partition_offset{cluster="stage",partition="2",topic="TESTdelete_station"} 5
kafka_burrow_topic_partition_offset{cluster="stage",partition="3",topic="TESTdelete_station"} 6
kafka_burrow_topic_partition_offset{cluster="stage",partition="4",topic="TESTdelete_station"} 6
kafka_burrow_total_lag{cluster="stage",group="test-burrow"} 6

(courtesy of burrow_exporter). This is a trivial example, but it shows the problem. The current offsets for the partitions of TESTdelete_station are [6, 5, 5, 6, 6]. The current offsets for my consumer group test-burroware [4, 4, 5, 5, 4]. Subtract one from the other and you should get lags of [2, 1, 0, 1, 2]. But this reports [0, 0, 0, 0, 0]. kafka_burrow_total_lag is correct, butkafka_burrow_partition_lag is not. This seems very strange to me (especially since the arithmetic to compute kafka_burrow_partition_lag is so simple!

MadDataScience avatar May 21 '18 23:05 MadDataScience

Update: it turns out burrow_exporter has not been updated to use v3 of the Burrow API (it's amazing it worked as well as it did!) I updated it so it would populate kafka_burrow_partition_lag from current-lag but it's still strange to me that the actual lags reported in Burrow (apart from current-lag seem to be incorrect.

MadDataScience avatar May 23 '18 16:05 MadDataScience

I also may share similar experience. Using Burrow with Prometheus exporter. sum(kafka_burrow_total_lag)by group doesn't match sum(kafka_burrow_partition_lag)by group. Not for all topics/consumers, though.

jacum avatar Jun 18 '18 20:06 jacum

We hit the first issue (takes too long to report a stopped consumer) yesterday, due to a consumer which kept failing and only got a handful of commits in over a number of hours. This led to a 5 hour evaluation window, which is way larger than we'd ever expect for any of our consumers.

I understand that this could be valid consumer behaviour if you only want to commit every 30 minutes, but in our case the commits were not evenly spaced out, they were sporadic. The two typical reasons for sporadic commits are:

  • intermittent errors, or
  • a low traffic partition (committing only when needed)

I think in both of those cases it's reasonable to shorten the window for considering the consumer stopped, because neither of those behaviours are OK when there is data to consume.

What if we took the minimum duration between any two commits in the window, and multiplied that by the number of commits in the window? That way if you commit regularly, there's no change. But the more unevenly your consumer commits, the shorter the window becomes. It'd never go shorter than num_samples * min_frequency, which would give you some control over the lower bound.

If committing is driven by volume more than traffic, this makes sense - the minimum time between commits is likely to be in line with how your consumer acts when there is a constant supply of new messages, whereas large times between commits are likely to be caused by errors or quiet periods.

timbertson-zd avatar Nov 22 '18 00:11 timbertson-zd

Detecting a stopped consumer in a reasonable time is problematic for us in Kafka Streams applications with low/spiky throughput input topics. Kafka Streams Applications do not allow configuring enable.auto.commit on the consumer to true in order to provide their transactional guarantees (src). That means there is no trivial way to guarantee regular commit intervals apart from explicitly generating more events in the input topics (read keepalive).

The solution proposed above updating consumer lag status based on the broker offset instead of the consumer group offset would be really helpful in monitoring Streams applications. Is there anything planned in this direction?

bfncs avatar May 22 '19 07:05 bfncs

I can't tell if this is a related issue or not, but I'm seeing stuff like the following when I stop a consumer:

kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="0",topic="TESTdelete_station"} 4
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="1",topic="TESTdelete_station"} 4
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="2",topic="TESTdelete_station"} 5
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="3",topic="TESTdelete_station"} 5
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="4",topic="TESTdelete_station"} 4
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="0",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="1",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="2",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="3",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="4",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="0",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="1",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="2",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="3",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="4",topic="TESTdelete_station"} 0
kafka_burrow_topic_partition_offset{cluster="stage",partition="0",topic="TESTdelete_station"} 6
kafka_burrow_topic_partition_offset{cluster="stage",partition="1",topic="TESTdelete_station"} 5
kafka_burrow_topic_partition_offset{cluster="stage",partition="2",topic="TESTdelete_station"} 5
kafka_burrow_topic_partition_offset{cluster="stage",partition="3",topic="TESTdelete_station"} 6
kafka_burrow_topic_partition_offset{cluster="stage",partition="4",topic="TESTdelete_station"} 6
kafka_burrow_total_lag{cluster="stage",group="test-burrow"} 6

(courtesy of burrow_exporter). This is a trivial example, but it shows the problem. The current offsets for the partitions of TESTdelete_station are [6, 5, 5, 6, 6]. The current offsets for my consumer group test-burroware [4, 4, 5, 5, 4]. Subtract one from the other and you should get lags of [2, 1, 0, 1, 2]. But this reports [0, 0, 0, 0, 0]. kafka_burrow_total_lag is correct, butkafka_burrow_partition_lag is not. This seems very strange to me (especially since the arithmetic to compute kafka_burrow_partition_lag is so simple!

I can't tell if this is a related issue or not, but I'm seeing stuff like the following when I stop a consumer:

kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="0",topic="TESTdelete_station"} 4
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="1",topic="TESTdelete_station"} 4
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="2",topic="TESTdelete_station"} 5
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="3",topic="TESTdelete_station"} 5
kafka_burrow_partition_current_offset{cluster="stage",group="test-burrow",partition="4",topic="TESTdelete_station"} 4
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="0",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="1",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="2",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="3",topic="TESTdelete_station"} 0
kafka_burrow_partition_lag{cluster="stage",group="test-burrow",partition="4",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="0",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="1",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="2",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="3",topic="TESTdelete_station"} 0
kafka_burrow_partition_max_offset{cluster="stage",group="test-burrow",partition="4",topic="TESTdelete_station"} 0
kafka_burrow_topic_partition_offset{cluster="stage",partition="0",topic="TESTdelete_station"} 6
kafka_burrow_topic_partition_offset{cluster="stage",partition="1",topic="TESTdelete_station"} 5
kafka_burrow_topic_partition_offset{cluster="stage",partition="2",topic="TESTdelete_station"} 5
kafka_burrow_topic_partition_offset{cluster="stage",partition="3",topic="TESTdelete_station"} 6
kafka_burrow_topic_partition_offset{cluster="stage",partition="4",topic="TESTdelete_station"} 6
kafka_burrow_total_lag{cluster="stage",group="test-burrow"} 6

(courtesy of burrow_exporter). This is a trivial example, but it shows the problem. The current offsets for the partitions of TESTdelete_station are [6, 5, 5, 6, 6]. The current offsets for my consumer group test-burroware [4, 4, 5, 5, 4]. Subtract one from the other and you should get lags of [2, 1, 0, 1, 2]. But this reports [0, 0, 0, 0, 0]. kafka_burrow_total_lag is correct, butkafka_burrow_partition_lag is not. This seems very strange to me (especially since the arithmetic to compute kafka_burrow_partition_lag is so simple!

I solved this problem by update burrow to 1.5

zhouyaxiong avatar Jul 20 '22 02:07 zhouyaxiong