librdkafka icon indicating copy to clipboard operation
librdkafka copied to clipboard

Client connection failures during Kafka rolling upgrade when broker listener configuration changes

Open DDDFiish opened this issue 3 months ago • 2 comments

Description

During a rolling upgrade of a multi-node Kafka cluster, we change the broker listener configuration through several steps, and restart brokers one by one. After the final listener configuration is applied and brokers are restarted, we observe that Kafka clients using librdkafka experience connection failures until the client process is restarted.

Environment

  • Kafka version: 3.9.0
  • librdkafka version: below 2.10.0

Upgrade Steps

We apply the following configuration changes step by step, restarting brokers after each change:

  1. listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
    listeners=SASL_PLAINTEXT://<hostname>:9092,BROKER://<hostname>:9094
    inter.broker.listener.name=SASL_PLAINTEXT
    
  2. listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
    listeners=SASL_PLAINTEXT://<hostname>:9092,BROKER://<hostname>:9094
    inter.broker.listener.name=BROKER
    
  3. listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
    listeners=BROKER://<hostname>:9092,SASL_PLAINTEXT://<hostname>:9094
    inter.broker.listener.name=BROKER
    
  4. listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
    listeners=BROKER://<hostname>:9092
    inter.broker.listener.name=BROKER
    

After step 4, when a broker is restarted, clients start reporting connection errors such as:

BrokerTransportFailure (Local: Broker transport failure): sasl_plaintext://khazad13:9092/167843919: Connection setup timed out in state CONNECT (after 30029ms in state CONNECT, 1 identical error(s) suppressed)
Connect to ipv4#[10.1.24.76:9094] failed: Connection refused (after 0ms in state CONNECT, 6 identical error(s) suppressed)

The issue appears to be that the listener name has changed, but the Kafka client is still trying to connect to the previous node (as shown in the example, attempting to connect to khazad13:9094, when this server no longer exists). The errors persist until the client process itself is restarted, after which everything works fine.

Observations

  • The issue occurs with older versions of librdkafka (e.g. 2.8.0).
  • With librdkafka 2.10.0, the issue still happens during the upgrade, but after all brokers are restarted, the clients recover without requiring a restart.
  • According to the changelog, there are fixes related to broker identification and removal of unavailable brokers (#4557, #4970). Is this behavior expected? Has the issue been fully resolved in 2.10.0 or later?

Questions

  • Is this client-side connection failure expected during rolling upgrade with listener changes?
  • Is a client restart the only workaround for older librdkafka versions?
  • Is this issue considered resolved in recent librdkafka versions, or are there recommended best practices for Kafka upgrades involving listener changes?

Thanks in advance for your help!

DDDFiish avatar Aug 25 '25 04:08 DDDFiish

same issue occurred after kafka pod restart gracefully or non-gracefully.

casablancaml avatar Nov 19 '25 08:11 casablancaml

@casablancaml @DDDFiish have tried to add a fix based on an event received, and force invalidating the DNS cache on the relevant event here - https://github.com/confluentinc/librdkafka/pull/5267.

sinhashubham95 avatar Dec 08 '25 06:12 sinhashubham95