faust icon indicating copy to clipboard operation
faust copied to clipboard

Rebalancing problem with Faust Streaming consumer

Open arcanjo45 opened this issue 2 years ago • 1 comments

Checklist

  • [x] I have included information about relevant versions
  • [x] I have verified that the issue persists when using the master branch of Faust.

Steps to reproduce

Hello everyone hope I find you well!

I'm facing an odd situation when using Faust Streaming in my consumer app. So I have a Kafka Consumer that connects to my Kafka GCP instance on my dev environment in Google Cloud. However when sometimes my Kafka instance restarts or goes down to lack of resources when my consumer tries to rebalance it stays stuck in a loop with the following errors logging:

[2023-12-20 10:23:47,912] [11] [INFO] Discovered coordinator 2 for group myapp-dev-processor 
[2023-12-20 10:23:47,912] [11] [INFO] (Re-)joining group myapp-dev-processor 
[2023-12-20 10:23:47,915] [11] [WARNING] Marking the coordinator dead (node 2)for group myapp-dev-processor. 

This is happening frequently for us only on our dev environment but we are investigating what may be the root cause of this issue and how to tackle it so that if it occurs in prod we have an way to act fast. We know the consumer connects to our kafka instance with success but then this error happens and it stays stuck in an endless loop. We tried search for any error log on our kafka instances but we don't find anything so we think this may be a problem within the library somehow.

I can also say that the only fix we discovered until now is to re-deploy both the kafka instance on GCP and our consumer project again which is a thing we can't do in production if this situations occurs. I don't have detailed knowledge of the project but it seems to be a problem related with the topic the app creates to handle task distribution because the name of the topic is not the one from where we consume data but the one that is created by the app.

Does anyone have any idea why this may be happening? We tried to search in the project repo, chatgpt and stack overflow but without any luck.

Expected behavior

Consumer should rebalance partitions normally.

Actual behavior

Full traceback

Versions

faust-aioeventlet==0.6

faust-streaming==0.10.14

confluent-kafka==2.1.1

Python 3.12

arcanjo45 avatar Dec 20 '23 11:12 arcanjo45

Looks like this is a waring from aiokafka: https://github.com/aio-libs/aiokafka/blob/c4b604062192d005cdcefb79eb6dbc717764c700/aiokafka/consumer/group_coordinator.py#L531-L535

It could have a variety of reasons this is happening. Can you share more detail about this error. Can you tweak the timeouts for connection idle, session, request?

dada-engineer avatar Apr 26 '25 18:04 dada-engineer