Burrow icon indicating copy to clipboard operation
Burrow copied to clipboard

Consumer status/lag showing NOTFOUND

Open carterdanko opened this issue 7 years ago • 9 comments

Using the api /v3/kafka/CLUSTER/consumer I am able to get a list of consumers, however when I try to pass any of those into either /v3/kafka/CLUSTER/consumer/ENTITY_FROM_CALL/status or /v3/kafka/CLUSTER/consumer/ENTITY_FROM_CALL/lag it returns cluster or consumer group not found. I am able to get a list of topics as well and then see the offset of that particular topic.

I am running on kafka 0.10.0.0, here is the config I am using.

[general]
pidfile="burrow.pid"
stdout-logfile="burrow.out"

[logging]
filename="logs/burrow.log"
level="info"
maxsize=100
maxbackups=30
maxage=10
use-localtime=false
use-compression=true

[zookeeper]
servers=[ "dv-kafka-01:2181", "dv-kafka-02:2181", "dv-kafka-03:2181" ]
timeout=6
root-path="/burrow_v1"

[client-profile.dv_kafka]
client-id="burrow-lagchecker"
kafka-version="0.10.0.0"

[cluster.dv_kafka]
class-name="kafka"
servers=[ "dv-kafka-01:9092", "dv-kafka-02:9092", "dv-kafka-03:9092", "dv-kafka-04:9092", "dv-kafka-05:9092", "dv-kafka-06:9092", "dv-kafka-07:9092", "dv-kafka-08:9092", "dv-kafka-09:9092", "dv-kafka-10:9092" ]
client-profile="dv_kafka"
topic-refresh=120
offset-refresh=30

[consumer.dv_kafka]
class-name="kafka"
cluster="dv_kafka"
servers=[ "dv-kafka-01:9092", "dv-kafka-02:9092", "dv-kafka-03:9092", "dv-kafka-04:9092", "dv-kafka-05:9092", "dv-kafka-06:9092", "dv-kafka-07:9092", "dv-kafka-08:9092", "dv-kafka-09:9092", "dv-kafka-10:9092" ]
client-profile="dv_kafka"
group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-).*$"
group-whitelist=""
offset-topic="__consumer_offsets"

[httpserver.default]
address=":8081"

carterdanko avatar Jan 09 '18 00:01 carterdanko

Having the exact same issue. Does anyone have input on this?

Odinodin avatar Jan 24 '18 09:01 Odinodin

Usually when you see something like this (a group exists, and then returns 404 on fetching status) it's because all of the offsets for the group are older than the expire-group time. Burrow does lazy expiration of this data only when the detail for the group is fetched (calling the consumer endpoint with either status, lag, or no qualifier, or by a notifier checking it).

It's hard to check for this within Burrow (as there's no way to fetch the data that doesn't cause it to be expired), but you can turn on debug logging (either by starting it with debug logging or by changing the log level using the API endpoint). I suggest restarting Burrow with debug logging enabled via config. This will log detail about every single offset that is read. You can then go through the logs and look for the offsets for your group and see what their timestamps are. You'll also see a "purge expired consumer" message when the expiration is triggered.

toddpalino avatar Jan 31 '18 15:01 toddpalino

@Odinodin Someone else I work with found the issue that burrow was still reading all of the offsets from __consumer_offsets and that it had not finished when I was trying to query the lag/status endpoints for a consumers. That seems to explain why listing the consumers works but it did not know metadata about those consumers since it was still working through that topic gathering everything.

carterdanko avatar Mar 20 '18 13:03 carterdanko

Hi I have the same issue here, so I'm trying to monitor Kafka consumers on our staging machine, below is the config file:

[general] pidfile="./burrow.pid" stdout-logfile="burrow.out"

[logging] filename="log/burrow.log" level="debug" maxsize=100 maxbackups=30 maxage=10 use-localtime=false use-compression=true

[zookeeper] servers=[ "example.com:2181" ] timeout=6

[client-profile.stage] kafka-version="0.10.2"

[client-profile.production] kafka-version="0.10.2"

[cluster.stage] class-name="kafka" servers=[ "example.com:9092" ] client-profile="stage" topic-refresh=120 offset-refresh=30 zookeeper-offset=true

[consumer.stage] class-name="kafka" cluster="stage" servers=[ "example.com:9092" ] client-profile="stage" offsets-topic="__consumer_offsets" #group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-).$" group-blacklist="." group-whitelist=""

[httpserver.default] address=":18002"

I can also see all the consumers but can't get the lag on the consumer, below is a snippet of the log: {"level":"debug","ts":1524773696.0555763,"msg":"dropped","type":"module","coordinator":"storage","class":"inmemory","name":"default","worker":7,"cluster":"stage","consumer":"ca_raw","topic":"ca_raw","partition":1,"topic_partition_count":0,"offset":24,"timestamp":1484519101275,"owner":"","request":"StorageSetConsumerOffset","reason":"old offset"}

also this is the result I got from hitting the /lag endpoint:

{
"error":
false,
"message":
"consumer status returned",
"request":
{
"host":
"MacBook-Pro.local",
"url":
"/v3/kafka/stage/consumer/ca_raw/lag"
},
"status":
{
"cluster":
"stage",
"complete":
1,
"group":
"ca_raw",
"maxlag":
null,
"partition_count":
0,
"partitions":
[],
"status":
"NOTFOUND",
"totallag":
0
}
}

Let me know if you need more information, thx in advance!

Enkri avatar Apr 26 '18 17:04 Enkri

Same issue here. Burrow runs just fine for a couple of days and then this exact problem starts happening. Restarting solves the issue temporarily for us, but a fix would be great. Edit: now even restarting doesn't solve the issue anymore :/

IngaFeick avatar Jul 02 '18 13:07 IngaFeick

I'm also seeing this issue for offsets that should be fairly recent

dtboctor avatar Jul 12 '19 22:07 dtboctor

Just found the same Issue after burrow restarted. If I log into kafka, The consumers are working as expected showing their lags been worked down. Burrow is unable to refresh and get group info

mtbbiker avatar Nov 21 '20 13:11 mtbbiker

Any update or workaround for this issue? its been happening most of the me

javiersoto15 avatar Mar 24 '21 16:03 javiersoto15

We've faced the same issue. The Kafka consumers are active and working fine, but not listed in burrow consumers.

debMan avatar May 23 '21 06:05 debMan