sarama
sarama copied to clipboard
AsyncProducer stops flushing retry buffers for specific partitions
Versions
Please specify real version numbers or git SHAs, not just "Latest" since that changes fairly regularly.
Sarama | Kafka | Go |
---|---|---|
1.32.0 | 3.0.0 | 1.17.6 |
Configuration
What configuration values are you using for Sarama and Kafka?
config.Version = sarama.V3_0_0_0
config.Net.DialTimeout = 10 * time.Second
config.Net.ReadTimeout = 10 * time.Second
config.Net.WriteTimeout = 10 * time.Second
config.Producer.Flush.Bytes = 262144
config.Producer.Flush.Frequency = 2500 * time.Millisecond
config.Producer.Return.Successes = true
config.Producer.Return.Errors = true
config.Producer.Compression = sarama.CompressionSnappy
config.Producer.Retry.Backoff = 500 * time.Millisecond
config.Metadata.Full = false
config.Metadata.RefreshFrequency = 5 * time.Minute
config.Metadata.Timeout = 5 * time.Second
Logs
Logs (where the issue occurred, excluding logs for other partitions): CLICK ME
15:27:22.853 producer/broker/2 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:22.853 producer/leader/topic/22 state change to [retrying-1]
15:27:22.853 producer/broker/2 state change to [closed] on topic/22
15:27:22.853 producer/leader/topic/22 abandoning broker 2
15:27:22.854 producer/leader/topic/22 selected broker 0
15:27:22.854 producer/broker/0 state change to [open] on topic/22
15:27:22.854 producer/leader/topic/22 state change to [flushing-1]
15:27:22.854 producer/leader/topic/22 state change to [normal]
15:27:22.854 producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:22.973 producer/leader/topic/22 state change to [retrying-2]
15:27:22.973 producer/leader/topic/22 abandoning broker 0
15:27:22.973 producer/broker/0 state change to [closed] on topic/22
15:27:23.856 producer/leader/topic/22 selected broker 0
15:27:23.856 producer/broker/0 state change to [open] on topic/22
15:27:23.856 producer/leader/topic/22 state change to [flushing-2]
15:27:23.856 producer/leader/topic/22 state change to [normal]
15:27:23.856 producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:23.856 producer/leader/topic/22 state change to [retrying-3]
15:27:23.856 producer/leader/topic/22 abandoning broker 0
15:27:23.856 producer/broker/0 state change to [closed] on topic/22
15:27:23.857 producer/leader/topic/22 selected broker 0
15:27:23.857 producer/broker/0 state change to [open] on topic/22
15:27:23.857 producer/leader/topic/22 state change to [flushing-3]
15:27:23.857 producer/leader/topic/22 state change to [normal]
15:27:23.857 producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:23.857 producer/leader/topic/22 state change to [retrying-3]
15:27:23.857 producer/leader/topic/22 abandoning broker 0
15:27:23.857 producer/broker/0 state change to [closed] on topic/22
15:27:23.857 producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:24.858 producer/leader/topic/22 selected broker 0
15:27:24.858 producer/broker/0 state change to [open] on topic/22
15:27:24.858 producer/leader/topic/22 state change to [flushing-3]
15:27:24.858 producer/leader/topic/22 state change to [normal]
15:27:24.858 producer/leader/topic/22 state change to [retrying-1]
15:27:24.858 producer/leader/topic/22 abandoning broker 0
15:27:24.858 producer/leader/topic/22 selected broker 0
15:27:24.859 producer/broker/0 state change to [open] on topic/22
Logs (expected behavior, excluding logs for other partitions): CLICK ME
15:27:22.615 producer/broker/2 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:22.615 producer/leader/topic/22 state change to [retrying-1]
15:27:22.616 producer/leader/topic/22 abandoning broker 2
15:27:22.616 producer/broker/2 state change to [closed] on topic/22
15:27:23.618 producer/leader/topic/22 selected broker 0
15:27:23.618 producer/broker/0 state change to [open] on topic/22
15:27:23.618 producer/leader/topic/22 state change to [flushing-1]
15:27:23.618 producer/leader/topic/22 state change to [normal]
15:27:23.740 producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:27:23.740 producer/leader/topic/22 state change to [retrying-2]
15:27:23.741 producer/broker/0 state change to [closed] on topic/22
15:27:23.741 producer/leader/topic/22 abandoning broker 0
15:27:23.862 producer/broker/2 state change to [open] on topic/22
15:27:23.862 producer/leader/topic/22 selected broker 2
15:27:23.862 producer/leader/topic/22 state change to [flushing-2]
15:27:23.862 producer/leader/topic/22 state change to [normal]
15:27:23.862 producer/leader/topic/22 state change to [retrying-3]
15:27:23.862 producer/leader/topic/22 abandoning broker 2
15:27:23.984 producer/leader/topic/22 selected broker 2
15:27:23.984 producer/broker/2 state change to [open] on topic/22
15:27:23.984 producer/leader/topic/22 state change to [flushing-3]
15:27:23.984 producer/leader/topic/22 state change to [normal]
15:27:23.984 producer/leader/topic/22 state change to [retrying-3]
15:27:24.096 producer/leader/topic/22 abandoning broker 2
15:27:24.985 producer/leader/topic/22 selected broker 0
15:27:24.985 producer/broker/0 state change to [open] on topic/22
15:27:24.985 producer/leader/topic/22 state change to [flushing-3]
15:27:24.985 producer/leader/topic/22 state change to [normal]
15:28:20.017 producer/broker/0 state change to [retrying] on topic/22 because kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date.
15:28:20.017 producer/leader/topic/22 state change to [retrying-1]
15:28:20.017 producer/broker/0 state change to [closed] on topic/22
15:28:20.018 producer/leader/topic/22 abandoning broker 0
15:28:20.018 producer/leader/topic/22 selected broker 2
15:28:20.018 producer/broker/2 state change to [open] on topic/22
15:28:20.018 producer/leader/topic/22 state change to [flushing-1]
15:28:20.018 producer/leader/topic/22 state change to [normal]
Problem Description
I have an application that uses the AsyncProducer
to write to a single topic. I was testing rolling restarts of our Kafka cluster, which is run by Strimzi, so I'm not really able to control the pace of the roll. I noticed an issue where during the rolls, some instances of the application appear to stop successfully writing to certain partitions, but continue to write to other partitions without any issues. I've waited up to 30 minutes to see if messages would eventually be sent.
My guess is that the retry buffers for those partitions don't get flushed for some reason, so all subsequent messages are buffered. I also saw growing memory usage due to the suspected buffering, and when I ran Go's memory profiler, it confirmed that growing memory was held where the byte array for each message's value was initialized.
Has anybody else run into this issue? Any insight or suggestions would be appreciated, thanks!
We have experienced the same issue during the rolling restart of our clusters. For now we have rolled back Sarama to v1.30.1, which solved the issue (v1.31 introduced multiple issues, v1.32 solved some of them). Probably related to https://github.com/Shopify/sarama/issues/2150.
@dethi Thanks for following up. I tried downgrading and have stopped seeing the issue, so it does seem related to changes made in v1.31 and beyond.
I might be misreading, but #2150 sounds specific to the sync producer and how the expectation
field is used? For what it's worth though, there wasn't any sync blocking that stood out when I had previously looked at our execution traces. Also, we would still be able to write to a different partition on the same broker as the partitions that seemed stuck.
The SyncProducer
uses the AsyncProducer
under the hood so this is likely a duplicate of #2150.
When the issue occurs, do you also see a record with null
key and value in your target topic?
Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.
As per @slaunay above, this is believed to have been fixed in v1.33.0 and newer