Burrow icon indicating copy to clipboard operation
Burrow copied to clipboard

Burrow sees consumer groups, and then doesn't

Open edgan opened this issue 5 years ago • 9 comments

I am running burrow 1.1.0 with kafka 0.11 and zookeeper 3.4.5. I can run a "curl http://127.0.0.1:8000/v3/kafka/local/consumer" right after I start burrow, and see all the groups. But if I run a "curl http://127.0.0.1:8000/v3/kafka/local/consumer/namehere/status" I get back

{"error":false,"message":"consumer status returned","status":{"cluster":"local","group":"namehere","status":"NOTFOUND","complete":1,"partitions":[],"partition_count":0,"maxlag":null,"totallag":0},"request":{"url":"/v3/kafka/local/consumer/namehere/status","host":"hostname-01"}}

Then if I run "curl http://127.0.0.1:8000/v3/kafka/local/consumer" again, I get a list without that one.

Even weirder this works fine in a staging environment, but is failing in prod. Even though both have the same versions kafka, burrow, and zookeeper.

edgan avatar Sep 21 '18 20:09 edgan

Logs:

{"level":"info","ts":1537562399.1103215,"msg":"stopping","type":"coordinator","name":"consumer"}
{"level":"info","ts":1537562399.1103935,"msg":"stopping","type":"module","coordinator":"consumer","class":"kafka","name":"local"}
{"level":"info","ts":1537562399.11174,"msg":"stopping","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1537562399.1142273,"msg":"Recv loop terminated: err=EOF","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1537562399.114262,"msg":"Send loop terminated: err=<nil>","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1537562399.1172094,"msg":"stopping","type":"coordinator","name":"cluster"}
{"level":"info","ts":1537562399.1172383,"msg":"stopping","type":"module","coordinator":"cluster","class":"kafka","name":"local"}
{"level":"info","ts":1537562399.1172647,"msg":"stopping","type":"coordinator","name":"notifier"}
{"level":"info","ts":1537562399.1172776,"msg":"shutdown","type":"coordinator","name":"httpserver"}
{"level":"info","ts":1537562399.1173916,"msg":"stopping","type":"coordinator","name":"evaluator"}
{"level":"info","ts":1537562399.1174622,"msg":"stopping","type":"module","coordinator":"evaluator","class":"caching","name":"default"}
{"level":"info","ts":1537562399.11998,"msg":"stopping","type":"coordinator","name":"storage"}
{"level":"info","ts":1537562399.120022,"msg":"stopping","type":"module","coordinator":"storage","class":"inmemory","name":"default"}
{"level":"info","ts":1537562399.120106,"msg":"stopping","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1537562399.1233048,"msg":"Recv loop terminated: err=EOF","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1537562399.1233966,"msg":"Send loop terminated: err=<nil>","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1537562399.1533234,"msg":"Started Burrow"}
{"level":"info","ts":1537562399.153498,"msg":"configuring","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1537562399.1536365,"msg":"configuring","type":"coordinator","name":"storage"}
{"level":"info","ts":1537562399.1537273,"msg":"configuring","type":"module","coordinator":"storage","class":"inmemory","name":"default"}
{"level":"info","ts":1537562399.1539078,"msg":"configuring","type":"coordinator","name":"evaluator"}
{"level":"info","ts":1537562399.1539931,"msg":"configuring","type":"module","coordinator":"evaluator","class":"caching","name":"default"}
{"level":"info","ts":1537562399.1540968,"msg":"configuring","type":"coordinator","name":"httpserver"}
{"level":"info","ts":1537562399.1546085,"msg":"configuring","type":"coordinator","name":"notifier"}
{"level":"info","ts":1537562399.154642,"msg":"configuring","type":"coordinator","name":"cluster"}
{"level":"info","ts":1537562399.1547153,"msg":"configuring","type":"module","coordinator":"cluster","class":"kafka","name":"local"}
{"level":"info","ts":1537562399.1548803,"msg":"configuring","type":"coordinator","name":"consumer"}
{"level":"info","ts":1537562399.1549742,"msg":"configuring","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1537562399.1552308,"msg":"configuring","type":"module","coordinator":"consumer","class":"kafka","name":"local"}
{"level":"info","ts":1537562399.1554272,"msg":"starting","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1537562399.1559339,"msg":"Connected to 10.2.30.121:2181","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1537562399.1586473,"msg":"Authenticated: id=100470660053855640, timeout=6000","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1537562399.1586776,"msg":"Re-submitting `0` credentials after reconnect","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1537562399.1622052,"msg":"starting","type":"coordinator","name":"storage"}
{"level":"info","ts":1537562399.1622314,"msg":"starting","type":"module","coordinator":"storage","class":"inmemory","name":"default"}
{"level":"info","ts":1537562399.162305,"msg":"starting","type":"coordinator","name":"evaluator"}
{"level":"info","ts":1537562399.1623225,"msg":"starting","type":"module","coordinator":"evaluator","class":"caching","name":"default"}
{"level":"info","ts":1537562399.1623378,"msg":"starting","type":"coordinator","name":"httpserver"}
{"level":"info","ts":1537562399.1625447,"msg":"started listener","type":"coordinator","name":"httpserver","listener":"[::]:8000"}
{"level":"info","ts":1537562399.162597,"msg":"starting","type":"coordinator","name":"notifier"}
{"level":"info","ts":1537562399.162623,"msg":"starting","type":"coordinator","name":"cluster"}
{"level":"info","ts":1537562399.1626327,"msg":"starting","type":"module","coordinator":"cluster","class":"kafka","name":"local"}
{"level":"info","ts":1537562399.1970122,"msg":"starting","type":"coordinator","name":"consumer"}
{"level":"info","ts":1537562399.197053,"msg":"starting","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1537562399.1983898,"msg":"Connected to 10.2.30.114:2181","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1537562399.2010047,"msg":"Authenticated: id=172350794107511546, timeout=30000","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1537562399.201041,"msg":"Re-submitting `0` credentials after reconnect","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1537562399.2022557,"msg":"starting","type":"module","coordinator":"consumer","class":"kafka","name":"local"}
{"level":"info","ts":1537562399.20591,"msg":"starting consumers","type":"module","coordinator":"consumer","class":"kafka","name":"local","topic":"__consumer_offsets","count":50}
{"level":"warn","ts":1537562399.2137485,"msg":"failed to read offset","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk","group":"secor_backup","topic":"connection","partition":0,"error":"zk: node does not exist"}
{"level":"warn","ts":1537562399.2176504,"msg":"failed to read offset","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk","group":"secor_backup","topic":"namehere4","partition":1,"error":"zk: node does not exist"}
{"level":"warn","ts":1537562399.2189863,"msg":"failed to read offset","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk","group":"secor_backup","topic":"namehere4","partition":2,"error":"zk: node does not exist"}
{"level":"warn","ts":1537562399.232973,"msg":"failed to read offset","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk","group":"secor_backup","topic":"namehere4","partition":6,"error":"zk: node does not exist"}
{"level":"info","ts":1537562399.2663305,"msg":"starting evaluations","type":"coordinator","name":"notifier"}
{"level":"info","ts":1537562418.18032,"msg":"cluster or consumer not found","type":"module","coordinator":"evaluator","class":"caching","name":"default","cluster":"local","consumer":"namehere","showall":false}
{"level":"info","ts":1537562438.8329,"msg":"cluster or consumer not found","type":"module","coordinator":"evaluator","class":"caching","name":"default","cluster":"local","consumer":"namehere3","showall":false}
{"level":"info","ts":1537562450.3536875,"msg":"cluster or consumer not found","type":"module","coordinator":"evaluator","class":"caching","name":"default","cluster":"local","consumer":"namehere2","showall":false}
{"level":"info","ts":1537562456.9158769,"msg":"cluster or consumer not found","type":"module","coordinator":"evaluator","class":"caching","name":"default","cluster":"local","consumer":"namehere","showall":false}

edgan avatar Sep 21 '18 20:09 edgan

I have tried playing with the whitelist/blacklist, with now effect. I have also diffed the staging and prod configuration files, and the only difference is the ip addresses. burrow.toml:

pidfile="/run/burrow/burrow.pid"
stdout-logfile="burrow.out"
client-id="burrow-lagchecker"

[logging]
filename="/var/log/burrow/burrow.log"
level="info"
maxsize=100
maxbackups=30
maxage=10
use-localtime=false
use-compression=true

[zookeeper]
servers=[ "10.2.30.200:2181","10.2.30.114:2181","10.2.30.121:2181" ]
timeout=6
lock-path="/burrow/notifier"
root-path="/burrow"

[client-profile.test]
client-id="burrow-test"
kafka-version="0.11.0"

[cluster.local]
class-name="kafka"
servers=[ "10.2.30.200:9092","10.2.30.114:9092","10.2.30.121:9092" ]
client-profile="test"
topic-refresh=120
offset-refresh=30

[consumer.local]
class-name="kafka"
cluster="local"
servers=[ "10.2.30.200:9092","10.2.30.114:9092","10.2.30.121:9092" ]
client-profile="test"
group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-|ureplicator-).*$"
group-whitelist=""

[consumer.local_zk]
class-name="kafka_zk"
cluster="local"
servers=[ "10.2.30.200:2181","10.2.30.114:2181","10.2.30.121:2181" ]
zookeeper-path="/kafka1"
zookeeper-timeout=30
group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-|ureplicator-).*$"
group-whitelist=""

[httpserver.default]
address=":8000"

[storage.default]
class-name="inmemory"
workers=20
intervals=15
expire-group=604800
min-distance=1

edgan avatar Sep 21 '18 20:09 edgan

Seeing this exact behavior.

zapient avatar Feb 11 '19 05:02 zapient

I'm getting the same behavior - Burrow just stopped detecting a few consumers/topics (although they are actively committing offsets). I've even tried reinstalling it entirely and nothing. As well, burrow is not reading any consumer ID/"owner" either

@toddpalino - do you know could be happening?

lmallonee avatar Mar 18 '19 20:03 lmallonee

We have been seeing this behavior as well. I finally had some time to do some research. We are using Confluent 5.1.2 and I have burrow configured as kafka-version="2.1.0". I also have the latest code from Burrow.

In my research, I found that the request for metadata from sarama was not returning data at least some of the time. This was really all of the time I was debugging the code, but it had to work at least some of the time for Burrow to work. This request is used to populate the topics in the in memory storage from what I can tell. I also noticed that sarama has a check on the Kafka version to see which metadata request version to send. See the code here.

I figured I would test to see if the lower version would return data for me, so I set my burrow client profile configuration to kafka-version="0.11.0.2". I also set the start-latest configuration to true because I only want to know about active consumers on restart/startup. I don't know if it makes a difference for others, but I did want to mention it as something I changed.

Between those two changes, all of my active consumer groups are reporting properly and are not disappearing from Burrow.

KyleCruz avatar Jun 12 '19 04:06 KyleCruz

Kafka Version: 0.10.2 I'm getting the below error:

{"level":"warn","ts":1563462481.399035,"msg":"failed to get zk lock","type":"coordinator","name":"notifier","error":"zk: zookeeper is closing"} {"level":"error","ts":1563462497.3998742,"msg":"failed to list groups","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk","error":"zk: node does not exist"}

nutanix-bigbasket avatar Jul 18 '19 15:07 nutanix-bigbasket

Kafka 0.12

{"level":"info","ts":1564469520.822827,"msg":"starting","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1564469520.825842,"msg":"Connected to [::1]:2180","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1564469520.827925,"msg":"Authenticated: id=72057722009747468, timeout=30000","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1564469520.827997,"msg":"Re-submitting `0` credentials after reconnect","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"error","ts":1564469520.829778,"msg":"failed to list groups","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk","error":"**zk: node does not exist**"}

adivardhan avatar Jul 30 '19 06:07 adivardhan

I see this behavior when using kafka-console-consumer.sh. But kafka-consumer-perf-test.sh is working.

knnnoppy avatar Oct 24 '19 10:10 knnnoppy

Any new ideas/solution to this Issue ?

mtbbiker avatar Nov 21 '20 13:11 mtbbiker