Description

When scaling down a replica count from 5 to 4 as an instance, there is a 60s disconnection until the consumer start reading messages back from one of the partitions.

a. Using Confluent Kafka SDKv2.1.1 with 2 consumers, both in the same consumer group consuming from the same topic. b. Consumer1 owns partitions 0, 1, and 2. Consumer2 owns partitions 3 and 4. c. consumer1 is scaled down, this triggers a rebalance in consumer2 within 10ms. The rebalance is completed quickly and consumer2 now owns all partitions. d. The partitions are then queried for events. The log indicates that events are queried for all partitions. The requests complete for Partitions 1, 2, 3, 4 and events are enqueued. Consumption then immediately follows for those partitions. e. 60 seconds after the partitions were checked for events, the request for partition 0’s events times out. The request retries and it finally receives the results and enqueues them. It was able to consume all queued messages in under 4ms. The events for partition 0 are then consumed real time.

We have tried changing the values for the following settings among others:

session.timeout.ms 
heartbeat.interval.ms
connections.max.idle.ms
max.poll.interval.ms
metadata.max.age.ms
replica.lag.time.max.ms
request.timeout.ms
AutoCommitIntervalMs
CoordinatorQueryIntervalMs
FetchMaxBytes
TopicMetadataRefreshIntervalMs
SocketKeepaliveEnable

Tweaking any of the above values changed the behavior. We have logs available and can reproduce the issue.

Checking the logs, we notice issue is due to Fetch request is being made on connection which is in bad state. Thus, Fetch request waits until timeout. Once timeout occurs, client initiates new fresh connection and subsequent request works.

Client opens multiple connections to EventHub namespace depending on # of partitions + # of virtualUriHost (virtual node id) returned by EventHub in metadata.

However, as repro maintains 1:1 ratio between # of partitions and consumer, it seems only one connection is in use and rest of them are going idle.

Now, once there is scale-in in consumer, partition moves to one of the active consumers. The Fetch request is made on a connection which is idle/in bad state. We tried tweaking some of the configuration like metadata refresh interval or setkeepalive, but none helped. Reducing sockettimeout does allow to reduce wait time.

Confluent Kafka library should identify the connection is in bad state before making fetch request.

There are other threads with similar symptoms like:

https://github.com/confluentinc/librdkafka/issues/3427 https://github.com/confluentinc/librdkafka/discussions/4039 https://github.com/confluentinc/librdkafka/issues/3578 https://github.com/confluentinc/confluent-kafka-dotnet/issues/2117 https://github.com/confluentinc/confluent-kafka-dotnet/issues/1876

How to reproduce

This was tested using Azure Event Hubs

Create an Event Hub Premium with at least five partitions.
Create a producer that produces a constant stream of events
Create five consumers with debug logs enabled
Let the consumers consume for at least six minutes
Scale down the replica count from 5 to 4.
Wait 60 seconds
The following can be observed in the logs of one of the consumers: 7.1. Within 1 second a rebalance occurs and successfully completes 7.2. The Consumer pod requests the events for each of the partitions it owns. Some of the requests may succeed within a second, but you may notice that some of them do no complete 7.3. After 60 seconds of the request, confluent kafka times out the request (60 seconds) 7.4. Immediately after the timeout, confluent kafka resends the request 7.5. The second request completes successfully after 1 second 7.6. The consumer begins consuming the event for that partition

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

[x] librdkafka version (release number or git tag): Confluent Kafka SDKv2.1.1
[x ] librdkafka client configuration: provided above
[x ] Operating system: ubuntu
[x ] Provide logs (with debug=.. as necessary) from librdkafka --> available but cannot share publicly
[x ] Provide broker log excerpts --> available but cannot share publicly
[ ] Critical issue

Nov 17 '23 21:11 gereyhe

Hello team, I would like to follow up on this issue. Our investigation shows that this behavior is controlled by the SDK and not the service. Any assistance will be really appreciated. Thanks!

Jan 03 '24 16:01 gereyhe

@gereyhe have you been able to reproduce this issue with v2.3.0 release?

Feb 05 '24 16:02 qrpike

@qrpike hello, we tried a config which avoided delay noticed with Fetch on scale-in. If we set ConnectionsMaxIdleMs lower than MetadataMaxAgeMs, it avoids having a bad connection and avoided delay on Fetch. Here are the values we tried with:

MetadataMaxAgeMs = 180000, ConnectionsMaxIdleMs = 150000,

Feb 05 '24 18:02 gereyhe

librdkafka
librdkafka copied to clipboard

partition disconnects for ~60s after scaling down consumers

Description

How to reproduce

Checklist

librdkafka librdkafka copied to clipboard

partition disconnects for ~60s after scaling down consumers

Description

How to reproduce

Checklist

librdkafka
librdkafka copied to clipboard