sarama AsyncProducer stops flushing retry buffers for specific partitions

Versions

Please specify real version numbers or git SHAs, not just "Latest" since that changes fairly regularly.

Sarama	Kafka	Go
1.32.0	3.0.0	1.17.6

Configuration

What configuration values are you using for Sarama and Kafka?

config.Version = sarama.V3_0_0_0

config.Net.DialTimeout = 10 * time.Second
config.Net.ReadTimeout = 10 * time.Second
config.Net.WriteTimeout = 10 * time.Second

config.Producer.Flush.Bytes = 262144
config.Producer.Flush.Frequency = 2500 * time.Millisecond 
config.Producer.Return.Successes = true
config.Producer.Return.Errors = true
config.Producer.Compression = sarama.CompressionSnappy
config.Producer.Retry.Backoff = 500 * time.Millisecond

config.Metadata.Full = false
config.Metadata.RefreshFrequency = 5 * time.Minute
config.Metadata.Timeout = 5 * time.Second

Logs

Logs (where the issue occurred, excluding logs for other partitions): CLICK ME

15:27:22.853	producer/broker/2 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:22.853	producer/leader/topic/22 state change to [retrying-1]
15:27:22.853	producer/broker/2 state change to [closed] on topic/22
15:27:22.853	producer/leader/topic/22 abandoning broker 2
15:27:22.854	producer/leader/topic/22 selected broker 0
15:27:22.854	producer/broker/0 state change to [open] on topic/22
15:27:22.854	producer/leader/topic/22 state change to [flushing-1]
15:27:22.854	producer/leader/topic/22 state change to [normal]
15:27:22.854	producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:22.973	producer/leader/topic/22 state change to [retrying-2]
15:27:22.973	producer/leader/topic/22 abandoning broker 0
15:27:22.973	producer/broker/0 state change to [closed] on topic/22
15:27:23.856	producer/leader/topic/22 selected broker 0
15:27:23.856	producer/broker/0 state change to [open] on topic/22
15:27:23.856	producer/leader/topic/22 state change to [flushing-2]
15:27:23.856	producer/leader/topic/22 state change to [normal]
15:27:23.856	producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:23.856	producer/leader/topic/22 state change to [retrying-3]
15:27:23.856	producer/leader/topic/22 abandoning broker 0
15:27:23.856	producer/broker/0 state change to [closed] on topic/22
15:27:23.857	producer/leader/topic/22 selected broker 0
15:27:23.857	producer/broker/0 state change to [open] on topic/22
15:27:23.857	producer/leader/topic/22 state change to [flushing-3]
15:27:23.857	producer/leader/topic/22 state change to [normal]
15:27:23.857	producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:23.857	producer/leader/topic/22 state change to [retrying-3]
15:27:23.857	producer/leader/topic/22 abandoning broker 0
15:27:23.857	producer/broker/0 state change to [closed] on topic/22
15:27:23.857	producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:24.858	producer/leader/topic/22 selected broker 0
15:27:24.858	producer/broker/0 state change to [open] on topic/22
15:27:24.858	producer/leader/topic/22 state change to [flushing-3]
15:27:24.858	producer/leader/topic/22 state change to [normal]
15:27:24.858	producer/leader/topic/22 state change to [retrying-1]
15:27:24.858	producer/leader/topic/22 abandoning broker 0
15:27:24.858	producer/leader/topic/22 selected broker 0
15:27:24.859	producer/broker/0 state change to [open] on topic/22

Logs (expected behavior, excluding logs for other partitions): CLICK ME

15:27:22.615	producer/broker/2 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:22.615	producer/leader/topic/22 state change to [retrying-1]
15:27:22.616	producer/leader/topic/22 abandoning broker 2
15:27:22.616	producer/broker/2 state change to [closed] on topic/22
15:27:23.618	producer/leader/topic/22 selected broker 0
15:27:23.618	producer/broker/0 state change to [open] on topic/22
15:27:23.618	producer/leader/topic/22 state change to [flushing-1]
15:27:23.618	producer/leader/topic/22 state change to [normal]
15:27:23.740	producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:23.740	producer/leader/topic/22 state change to [retrying-2]
15:27:23.741	producer/broker/0 state change to [closed] on topic/22
15:27:23.741	producer/leader/topic/22 abandoning broker 0
15:27:23.862	producer/broker/2 state change to [open] on topic/22
15:27:23.862	producer/leader/topic/22 selected broker 2
15:27:23.862	producer/leader/topic/22 state change to [flushing-2]
15:27:23.862	producer/leader/topic/22 state change to [normal]
15:27:23.862	producer/leader/topic/22 state change to [retrying-3]
15:27:23.862	producer/leader/topic/22 abandoning broker 2
15:27:23.984	producer/leader/topic/22 selected broker 2
15:27:23.984	producer/broker/2 state change to [open] on topic/22
15:27:23.984	producer/leader/topic/22 state change to [flushing-3]
15:27:23.984	producer/leader/topic/22 state change to [normal]
15:27:23.984	producer/leader/topic/22 state change to [retrying-3]
15:27:24.096	producer/leader/topic/22 abandoning broker 2
15:27:24.985	producer/leader/topic/22 selected broker 0
15:27:24.985	producer/broker/0 state change to [open] on topic/22
15:27:24.985	producer/leader/topic/22 state change to [flushing-3]
15:27:24.985	producer/leader/topic/22 state change to [normal]
15:28:20.017	producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:28:20.017	producer/leader/topic/22 state change to [retrying-1]
15:28:20.017	producer/broker/0 state change to [closed] on topic/22
15:28:20.018	producer/leader/topic/22 abandoning broker 0
15:28:20.018	producer/leader/topic/22 selected broker 2
15:28:20.018	producer/broker/2 state change to [open] on topic/22
15:28:20.018	producer/leader/topic/22 state change to [flushing-1]
15:28:20.018	producer/leader/topic/22 state change to [normal]

Problem Description

I have an application that uses the AsyncProducer to write to a single topic. I was testing rolling restarts of our Kafka cluster, which is run by Strimzi, so I'm not really able to control the pace of the roll. I noticed an issue where during the rolls, some instances of the application appear to stop successfully writing to certain partitions, but continue to write to other partitions without any issues. I've waited up to 30 minutes to see if messages would eventually be sent.

My guess is that the retry buffers for those partitions don't get flushed for some reason, so all subsequent messages are buffered. I also saw growing memory usage due to the suspected buffering, and when I ran Go's memory profiler, it confirmed that growing memory was held where the byte array for each message's value was initialized.

Has anybody else run into this issue? Any insight or suggestions would be appreciated, thanks!

Mar 12 '22 03:03 ericayin

We have experienced the same issue during the rolling restart of our clusters. For now we have rolled back Sarama to v1.30.1, which solved the issue (v1.31 introduced multiple issues, v1.32 solved some of them). Probably related to https://github.com/Shopify/sarama/issues/2150.

Mar 14 '22 12:03 dethi

@dethi Thanks for following up. I tried downgrading and have stopped seeing the issue, so it does seem related to changes made in v1.31 and beyond.

I might be misreading, but #2150 sounds specific to the sync producer and how the expectation field is used? For what it's worth though, there wasn't any sync blocking that stood out when I had previously looked at our execution traces. Also, we would still be able to write to a different partition on the same broker as the partitions that seemed stuck.

Mar 14 '22 19:03 ericayin

The SyncProducer uses the AsyncProducer under the hood so this is likely a duplicate of #2150.

When the issue occurs, do you also see a record with null key and value in your target topic?

Mar 19 '22 03:03 slaunay

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

Aug 18 '23 08:08 github-actions[bot]

As per @slaunay above, this is believed to have been fixed in v1.33.0 and newer

Aug 18 '23 09:08 dnwe

sarama sarama copied to clipboard

AsyncProducer stops flushing retry buffers for specific partitions

Versions

Configuration

Logs

Problem Description

sarama
sarama copied to clipboard