Description

We are producing into a topic with 84 partitions over 8 brokers. We are seeing the issue where a recoverable error condition (difficult to reproduce, but an example would be transport failure causing a broker connection to fail) will trigger the underlying librdkafka library to continually refresh the metadata for this topic at the retry.backoff.max.ms, even though the metadata request is successful and all partitions have leaders.

e.g. here is the error condition happening which is triggered by a new connection to a broker being broken just ~8s after entering UP state

%7|1704816898.865|CONNECT|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected to ipv4#10.255.253.3:443
%7|1704816898.865|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state CONNECT -> SSL_HANDSHAKE
%7|1704816898.865|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.886|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.888|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.907|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.907|CONNECTED|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected (#13)
%7|1704816898.907|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state SSL_HANDSHAKE -> APIVERSION_QUERY
%7|1704816898.907|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.926|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state APIVERSION_QUERY -> UP
%7|1704816898.926|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: Skipping metadata refresh of 1 topic(s): connected: already being requested
%7|1704816906.827|FAIL|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Disconnected (after 7900ms in state UP) (_TRANSPORT)
%6|1704816906.827|FAIL|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Disconnected (after 7900ms in state UP)
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state UP -> DOWN
%7|1704816906.827|REQERR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: ProduceRequest failed: Local: Broker transport failure: explicit actions Refresh,MsgPossiblyPersisted
%7|1704816906.827|BROKERUA|producer.host#producer-2| [thrd:ssl://broker-6:4]: produce_topic [51]: broker unavailable: produce: Local: Broker transport failure
%7|1704816906.827|FASTQUERY|producer.host#producer-2| [thrd:ssl://broker-6:4]: Starting fast leader query
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: Requesting metadata for 1/1 topics: broker down
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-5:443/5: Request metadata for 1 topic(s): broker down
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: Request metadata for 1 topic(s): partition leader query
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state DOWN -> INIT
%7|1704816906.827|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state INIT -> TRY_CONNECT
%7|1704816906.827|CONNECT|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: broker in state TRY_CONNECT connecting
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state TRY_CONNECT -> CONNECT
%7|1704816906.843|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: ===== Received metadata (for 1 requested topics): broker down =====
%7|1704816906.858|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816906.859|CONNECT|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected to ipv4#10.255.253.3:443
%7|1704816906.859|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state CONNECT -> SSL_HANDSHAKE
%7|1704816906.859|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.878|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.880|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.897|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.897|CONNECTED|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected (#14)
%7|1704816906.897|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state SSL_HANDSHAKE -> APIVERSION_QUERY
%7|1704816906.897|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.912|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state APIVERSION_QUERY -> UP
%7|1704816906.912|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: Skipping metadata refresh of 1 topic(s): connected: already being requested
%7|1704816906.927|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query

and here we can see that even though every one of these metadata requests is returning successfully (I've truncated the response for readability) and all the partitions have leaders, the client is still refreshing the metadata as if something were broken. Note that during this the producer is working fine and messages are getting delivered fine, but we have noticed on our servers that the number of metadata requests from our clients is huge.

%7|1704816906.843|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816906.858|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: ===== Received metadata (for 1 requested topics): partition leader query =====
%7|1704816906.858|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816906.927|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 100ms later
%7|1704816906.943|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816907.115|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 200ms later
%7|1704816907.131|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816907.475|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 400ms later
%7|1704816907.491|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816908.227|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 800ms later
%7|1704816908.243|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816909.227|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 1s later

It looks to me like there could be some sort of race condition around broker disconnects/reconnects and metadata refreshing?

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

[x] librdkafka version (release number or git tag): 2.3.0
[x] Apache Kafka version: 3.4.0
[x] librdkafka client configuration: message.max.bytes=25165824;socket.timeout.ms=10000;socket.keepalive.enable=true;debug=metadata,broker,topic
[X] Operating system: centos7
[X] Provide logs (with debug=.. as necessary) from librdkafka

Jan 09 '24 19:01 jpiper

Reverting to 2.2.0 has fixed the issue so I assume this is a regression introduced in 2.3.0

Jan 10 '24 13:01 jpiper

I can confirm this as well, combined with the very low value of retry.backoff.max.ms (1s) the latest version of librdkafka can easily turn into a DDOS cannon at any moment once a retriable error is encountered.

Jan 29 '24 17:01 richardartoul

I can confirm this as well, combined with the very low value of retry.backoff.max.ms (1s) the latest version of librdkafka can easily turn into a DDOS cannon at any moment once a retriable error is encountered.

Wouldn't increasing retry.backoff.max.ms mitigate this?

Jan 29 '24 17:01 mensfeld

I can confirm this as well, combined with the very low value of retry.backoff.max.ms (1s) the latest version of librdkafka can easily turn into a DDOS cannon at any moment once a retriable error is encountered.

Wouldn't increasing retry.backoff.max.ms mitigate this?

ehh... if by mitigate you mean "reduce the impact of the bug on the kafka cluster" - it doesn't actually solve the issue at hand.

Jan 29 '24 17:01 jpiper

Yeah that's a viable work-around in the interim, but this behavior is clearly a bug and there is a lot of software out there that isn't gonna increase that value. It's really common for people to have thousands of Kafka clients/producers and one day they're going to upgrade and everything will work fine until one of their connections has an intermittent error in the middle of the night and their cluster gets melted :/

Jan 29 '24 17:01 richardartoul

@jpiper sorry, this is what I meant. I am a maintainer of the ruby bindings and I usually do my best to ship intermediate fixes when stuff like this occurs. This is why I asked. I may consider changing the defaults for my lib until patch is released.

// edit - I see default for librdkafka is 1 second (1000ms) so as long as not altered, it should be ok)

Jan 29 '24 17:01 mensfeld

... until one of their connections has an intermittent error in the middle of the night and their cluster gets melted :/

I feel seen

Jan 29 '24 21:01 jpiper

There is a problem in v2.3.0 when metadata is refreshed without leader changes that avoids that fast metadata refresh is stopped. We'll include this fix in next release.

Feb 27 '24 11:02 emasab

librdkafka
librdkafka copied to clipboard

producer continually refreshing client metadata after broker disconnection

Description

Checklist

librdkafka librdkafka copied to clipboard

producer continually refreshing client metadata after broker disconnection

Description

Checklist

librdkafka
librdkafka copied to clipboard