librdkafka icon indicating copy to clipboard operation
librdkafka copied to clipboard

Partitions get revoked and assigned multiple times on other consumer on shut down of one consumer for Cooperative sticky assignment strategybeh

Open pratikthakkar24 opened this issue 3 years ago • 4 comments

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Do NOT create issues for questions, use the discussion forum: https://github.com/edenhill/librdkafka/discussions

Description

<I have a single topic, 3 partitions and 3 consumers. I am using "rebalance_cb" to register my rebalance callback in all the consumers and i am also using subscribe() in all consumers. That means I am not manually assigning any partition to the consumers. I am using cooperative-sticky as the partition assignment strategy.

Now lets assume, upon startup of producer (a single producer is producing messages on all the partitions) and consumers, Kafka has assigned in the following manner. Partition 0 --> Consumer 1 Partition 1 --> Consumer 2 Partition 2 --> Consumer 3

Now when i shut down Consumer 1, ideally according to the cooperative-sticky behavior, partition 0 should be assigned to one of the other two consumers. That also happens in my case.

Lets say Partition 0 gets assigned to Consumer 2. But after it gets assigned, again both the partitions (0 and 1) gets revoked from Consumer 2 and they again get assigned to Consumer 2. This process of revoking and assigning both the partitions happens approximately 7 to 8 times after which any revocation or assignment dont happen.

While this process of revocation and assignment of 0 and 1 partitions in Consumer 2 is ongoing, Messages are also consumed by that consumer. Some of these messages which were consumed during the above process (revocation and assignment) are received duplicate in that consumer. I suppose this behavior is probably because of the failure of committing those messages to broker while the partitions were revoked and again assigned. (I have used manual commit on each message)

My question was whether this process of revocation and assignment of both partitions on consumer 2 multiple times (upon shut down of consumer 1) is normal ?

I am not able to reproduce the behavior every time i shut down consumer 1, sometimes this process of multiple revocation and assignment doesn't happen and sometimes it happens.

Kindly throw some light.>

How to reproduce

<your steps how to reproduce goes here, or remove section if not relevant> IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • [1.9.0.1] librdkafka version (release number or git tag): <REPLACE with e.g., v0.10.5 or a git sha. NOT "latest" or "current"> - [ ] Apache Kafka version: <REPLACE with e.g., 0.10.2.3>
  • [2.8] librdkafka client configuration: <REPLACE with e.g., message.timeout.ms=123, auto.reset.offset=earliest, ..>
  • [Windows Server 2012 R2] Operating system: <REPLACE with e.g., Centos 5 (x64)>
  • [ ] Provide logs (with debug=.. as necessary) from librdkafka
  • [ ] Provide broker log excerpts
  • [ ] Critical issue

pratikthakkar24 avatar Jun 29 '22 14:06 pratikthakkar24

I observe the same behaviour with v1.9.0. Partitions that should stay assigned are being revoked upon one consumer close.

mensfeld avatar Sep 02 '22 08:09 mensfeld

@pratikthakkar24 I was able to figure out the reason on my side and was able to reproduce it.

Maybe it's going to be helpful to you:

I was seeing this because, during the rebalance but prior to running the rebalance callback itself, I was calling the rd_kafka_commit. It still impacted the process despite being called with an async flag. Once I got rid of it completely, the cooperative-sticky works as expected.

ref https://github.com/karafka/karafka/pull/1050/files#diff-a9d9ccd2fc27b8adc58d028232236b82563c12e7b46ded7eca808ea9a1ec7baeL124

mensfeld avatar Oct 07 '22 16:10 mensfeld

The problem is described here https://github.com/confluentinc/librdkafka/issues/4059
In short: manual offset commits during rebalance lead to group generation id error and rebalance of the consumer who sent this commit. Use auto commit instead.

mironovdm avatar Dec 25 '22 00:12 mironovdm

Closing as based on the previous comment, the issue is captured in https://github.com/confluentinc/librdkafka/issues/4059

If this is indeed different, please reopen or make a new issue.

nhaq-confluent avatar Feb 27 '24 13:02 nhaq-confluent