librdkafka Client connection failures during Kafka rolling upgrade when broker listener configuration changes

Description

During a rolling upgrade of a multi-node Kafka cluster, we change the broker listener configuration through several steps, and restart brokers one by one. After the final listener configuration is applied and brokers are restarted, we observe that Kafka clients using librdkafka experience connection failures until the client process is restarted.

Environment

Kafka version: 3.9.0
librdkafka version: below 2.10.0

Upgrade Steps

We apply the following configuration changes step by step, restarting brokers after each change:

listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
listeners=SASL_PLAINTEXT://<hostname>:9092,BROKER://<hostname>:9094
inter.broker.listener.name=SASL_PLAINTEXT

listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
listeners=SASL_PLAINTEXT://<hostname>:9092,BROKER://<hostname>:9094
inter.broker.listener.name=BROKER

listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
listeners=BROKER://<hostname>:9092,SASL_PLAINTEXT://<hostname>:9094
inter.broker.listener.name=BROKER

listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
listeners=BROKER://<hostname>:9092
inter.broker.listener.name=BROKER

After step 4, when a broker is restarted, clients start reporting connection errors such as:

BrokerTransportFailure (Local: Broker transport failure): sasl_plaintext://khazad13:9092/167843919: Connection setup timed out in state CONNECT (after 30029ms in state CONNECT, 1 identical error(s) suppressed)
Connect to ipv4#[10.1.24.76:9094] failed: Connection refused (after 0ms in state CONNECT, 6 identical error(s) suppressed)

The issue appears to be that the listener name has changed, but the Kafka client is still trying to connect to the previous node (as shown in the example, attempting to connect to khazad13:9094, when this server no longer exists). The errors persist until the client process itself is restarted, after which everything works fine.

Observations

The issue occurs with older versions of librdkafka (e.g. 2.8.0).
With librdkafka 2.10.0, the issue still happens during the upgrade, but after all brokers are restarted, the clients recover without requiring a restart.
According to the changelog, there are fixes related to broker identification and removal of unavailable brokers (#4557, #4970). Is this behavior expected? Has the issue been fully resolved in 2.10.0 or later?

Questions

Is this client-side connection failure expected during rolling upgrade with listener changes?
Is a client restart the only workaround for older librdkafka versions?
Is this issue considered resolved in recent librdkafka versions, or are there recommended best practices for Kafka upgrades involving listener changes?

Thanks in advance for your help!

Aug 25 '25 04:08 DDDFiish

same issue occurred after kafka pod restart gracefully or non-gracefully.

Nov 19 '25 08:11 casablancaml

@casablancaml @DDDFiish have tried to add a fix based on an event received, and force invalidating the DNS cache on the relevant event here - https://github.com/confluentinc/librdkafka/pull/5267.

Dec 08 '25 06:12 sinhashubham95