sarama icon indicating copy to clipboard operation
sarama copied to clipboard

Kafka partitions leaderless issue (during metadata refresh) with topics that have been deleted

Open mohilkhare21 opened this issue 1 year ago • 7 comments

Description

We are seeing an issue with Sarama kafka GO client where on topics deletion, kafka client gets stuck in a metadata refresh loop (due to retry mechanism ) and keep flooding with following logs:

SARAMA-DEBUG 03:18:52 client/metadata fetching metadata for [list of topics] from broker SARAMA-DEBUG 03:18:53 client/metadata found some partitions to be leaderless SARAMA-DEBUG 03:18:53 client/metadata retrying after 2000ms... (40 attempts remaining)

It can be reproduced very easily when kafka topic is deleted while client is still running. The only way to recover from this is to restart our client. However, in production, we can't afford to restart our clients which are in the middle of working on some other messages.

Versions
Sarama Kafka Go
v1.41.2 , client: sk.V2_8_0_0 2.8 1.19
Configuration

Our Sarama client configuration snippet:

clientConfig.Version = sk.V2_8_0_0
clientConfig.Admin.Timeout = 30 * time.Second
clientConfig.Net.KeepAlive = 4 * time.Minute
clientConfig.Net.SASL.Enable = false
clientConfig.Net.SASL.Handshake = false
clientConfig.Consumer.Return.Errors = true
clientConfig.Producer.RequiredAcks = sk.WaitForLocal
clientConfig.Producer.Retry.Max = 20
clientConfig.Producer.Retry.Backoff = 200 * time.Millisecond
clientConfig.ClientID = msgBusConfig.ClientID
clientConfig.Metadata.Retry.Backoff = time.Millisecond * 2000
clientConfig.Metadata.Retry.Max = 50
clientConfig.Metadata.RefreshFrequency = 5 * time.Minute
clientConfig.Metadata.Full = false // prevents fetching the metadata of all topics
clientConfig.Consumer.Group.Rebalance.Retry.Max = 10
clientConfig.Consumer.Group.Rebalance.Retry.Backoff = 5 * time.Second
Additional Context

We looked at the source code and tried to understand the behavior of RefreshMetadata and retry mechanism and it appears that client maintains an array of topics using which it goes into refreshMetadata cycle. However, currently it doesn't seem to delete topic from its internal array if topic is not available on Broker.

We think that "func updateMetadata" needs to handle "ErrInvalidTopic" differently and apart from returning err, it also needs to remove that particular topic from its topics list.

mohilkhare21 avatar Oct 18 '23 18:10 mohilkhare21

I had a quick play around, and I think this behavior is straight-forward to re-produce. I used Kafka 3.6.0 (in ZK mode, with auto.create.topics.enable=false) and the consumer group example from Sarama v1.41.3. Steps:

  1. Create two topics (topic1 and topic2)
  2. Start the sample: ./consumegroup -brokers localhost:9092 -topics topic1,topic2 -group testgroup -verbose
  3. Delete topic2

@mohilkhare21 - is this a reasonable approximation of what your application does?

From my initial investigation, I disagree with the proposal that Sarama should handle ErrInvalidTopic by removing the topic from the list of topics that the client makes metadata request for. I think this would prevent Sarama from starting to consume from the topic were it to be re-created, which is the current behavior.

While verbose, the detail in the debug logging that is generated does seem reasonable. Tracking when metadata requests are made, and their results is helpful in debugging other kinds of problems, although I appreciate unneeded in this particular case.

Do you need debug-level logging to be enabled all the time? It looks like your are specifically registering a debug logger - could this be disabled by default and only switched on when you have a specific problem?

prestona avatar Oct 22 '23 20:10 prestona

Hello @prestona ,

Thanks for looking into this issue. Yes, you are right with the use case that we have here in our environment.

I have a question regarding "I think this would prevent Sarama from starting to consume from the topic were it to be re-created" .

When a topic, whether the same or different, is created, will Sarama repopulate its list of topics?

Thanks

mohilkhare21 avatar Oct 24 '23 20:10 mohilkhare21

While I was playing around with ./consumegroup -brokers localhost:9092 -topics topic1,topic2 -group testgroup -verbose, I saw two behaviors:

If topic1 exists, but topic2 doesn't exist at the time the example app is run - then then I get debug log lines similar to the ones that you are seeing. When the "attempts remaining" hits zero the client.Consume returns and error and the example code panics here.

If both topics exist when the app is started, but I subsequently delete topic2 - then the app will run indefinitely. It generates the same debug logs for the metadata, but when the attempts remaining hits zero it gets reset back up to the value in Config.Metadata.Retry.Max. If I then re-create a topic called topic2 the debug logging stops, and the example app starts to consume from the newly created topic2.

prestona avatar Oct 30 '23 13:10 prestona

Thanks right @prestona . But in our case, after deleting topic, we don't recreate same topic due to which we keep getting those logs as you described.

mohilkhare21 avatar Oct 31 '23 23:10 mohilkhare21

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

github-actions[bot] avatar Feb 02 '24 04:02 github-actions[bot]

@mohilkhare21 I'm not sure that there's much more we can do in Sarama for this.

As noted above, when you start a client and ask it to create/join a consumer group, calling Consume(...) you provide the list of topics that you want the consumer group to consume from. At startup it checks that those topics do indeed exist and then creates+joins the group. It implicitly expects that those topics will exist for the lifetime of the consumer group.

If you want to be able to delete topic(s) that you'll never re-create and which are part of active consumer groups, you'd really need those consumer groups to be restarted and only call Consume(...) with the topic(s) that are still going to exist

dnwe avatar Feb 11 '24 14:02 dnwe

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

github-actions[bot] avatar May 17 '24 22:05 github-actions[bot]