sarama icon indicating copy to clipboard operation
sarama copied to clipboard

Broker connection shut down too early and follower broker is not becoming leader during broker shutdown

Open iamnaman863 opened this issue 2 years ago • 3 comments

Versions

Please specify real version numbers or git SHAs, not just "Latest" since that changes fairly regularly.

Sarama Kafka Go
1.31.1 V2_0_0_0 1.17
Configuration

What configuration values are you using for Sarama and Kafka?

        if config.Metadata.Retry.Backoff == 0 {
		config.Metadata.Retry.Backoff = 100000000
	}

	if config.Metadata.Retry.Max == 0 {
		config.Metadata.Retry.Max = 5
	}

	if config.Producer.Retry.Backoff == 0 {
		config.Producer.Retry.Backoff = 5000000
	}

	if config.Producer.Timeout == 0 {
		config.Producer.Timeout = 300000000
	}

	if config.Producer.RequiredAcks == 0 {
		config.Producer.RequiredAcks = 1
	}

	config.Producer.Return.Successes = true
	config.Producer.Return.Errors = true

	if config.Consumer.Retry.Backoff == 0 {
		config.Consumer.Retry.Backoff = 5000000
	}

	if config.Consumer.MaxWaitTime == 0 {
		config.Consumer.MaxWaitTime = 300000000
	}

	if config.Consumer.Group.Heartbeat.Interval == 0 {
		config.Consumer.Group.Heartbeat.Interval = 1000000
	}
Logs

When filing an issue please provide logs from Sarama and Kafka if at all possible. You can set sarama.Logger to a log.Logger to capture Sarama debug output.

logs: CLICK ME

DEBU [13:22:15]  Initializing new client

                 (Memory: 6104712 GoRoutines: 2)

DEBU [13:22:15]  ClientID is the default of 'sarama', you should consider setting it to something application-specific.

                 (Memory: 6106448 GoRoutines: 2)

DEBU [13:22:15]  client/metadata fetching metadata for all topics from broker pkc-pgq85.us-west-2.aws.confluent.cloud:9092

                 (Memory: 6118696 GoRoutines: 3)

DEBU [13:22:16]  Successful SASL handshake. Available mechanisms:  [PLAIN OAUTHBEARER]

                 (Memory: 6235952 GoRoutines: 4)

DEBU [13:22:17]  SASL authentication successful with broker pkc-pgq85.us-west-2.aws.confluent.cloud:9092:4 - [0 0 0 0]

                 (Memory: 6238352 GoRoutines: 4)

DEBU [13:22:17]  Connected to broker at pkc-pgq85.us-west-2.aws.confluent.cloud:9092 (unregistered)

                 (Memory: 6240688 GoRoutines: 4)

DEBU [13:22:17]  client/brokers registered new broker #0 at b0-pkc-pgq85.us-west-2.aws.confluent.cloud:9092

                 (Memory: 6277576 GoRoutines: 4)

DEBU [13:22:17]  client/brokers registered new broker #5 at b5-pkc-pgq85.us-west-2.aws.confluent.cloud:9092

                 (Memory: 6279184 GoRoutines: 4)

Same log for brokers(10, 20, 15, 25, 26, 9, 11, 4, 16, 17,3, 24, 23, 28, 13, 18, 22, 2, 8, 12, 29, 19, 14, 1, 6, 7, 21, 27)

                 (Memory: 6301048 GoRoutines: 4)                

DEBU [13:22:19]  Successfully initialized new client

                 (Memory: 6495016 GoRoutines: 11)

INFO [13:22:19]  Kafka initialized. Hosts: pkc-pgq85.us-west-2.aws.confluent.cloud:9092, Topic: cs-tool

                 (Memory: 6504560 GoRoutines: 11)

INFO [13:22:21]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: autorenewal_update

                 (Memory: 6778208 GoRoutines: 12)

INFO [13:22:21]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: cancel

                 (Memory: 5889800 GoRoutines: 12)

INFO [13:22:21]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: delete_payment

                 (Memory: 5949856 GoRoutines: 12)

INFO [13:22:22]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: download_link

                 (Memory: 6014088 GoRoutines: 12)

INFO [13:22:22]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: extend_expiry

                 (Memory: 6070136 GoRoutines: 12)

INFO [13:22:22]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: refund

                 (Memory: 6147776 GoRoutines: 12)

INFO [13:22:23]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: move_subscription

                 (Memory: 6193168 GoRoutines: 12)

INFO [13:22:23]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: purchase_link

                 (Memory: 6254184 GoRoutines: 12)

INFO [13:22:23]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: register_user

                 (Memory: 6295648 GoRoutines: 12)

INFO [13:22:23]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: renew_cof

                 (Memory: 6367656 GoRoutines: 12)

INFO [13:22:24]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: reset_password

                 (Memory: 6406032 GoRoutines: 12)

INFO [13:22:24]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: Search

                 (Memory: 6444208 GoRoutines: 12)

INFO [13:22:25]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: update_device

                 (Memory: 6492472 GoRoutines: 12)

INFO [13:22:25]  Avro initialized! SchemaRegistry: https://psrc-1wydj.us-east-2.aws.confluent.cloud SchemaVersion: 1, Subject: update_profile

                 (Memory: 6538048 GoRoutines: 12)

INFO [13:22:25]  Starting metrics server at :2121

                 (Memory: 6680536 GoRoutines: 12)

INFO [13:22:25]  POST /api/events/publish POST /api/v0/events/publish POST /api/v1/events/publish GET /.well-known/health-check HEAD /.well-known/health-check GET /.well-known/heartbeat HEAD /.well-known/heartbeat GET /.well-known/openapi.json HEAD /.well-known/openapi.json GET /.well-known/swagger HEAD /.well-known/swagger GET /.well-known/swagger/{name} HEAD /.well-known/swagger/{name}

                 (Memory: 6687880 GoRoutines: 13)

INFO [13:22:25]  starting http server at :8000

                 (Memory: 6691664 GoRoutines: 14)

DEBU [13:23:11]  producer/broker/19 starting up

                 (Memory: 7116720 GoRoutines: 26)

DEBU [13:23:11]  producer/broker/19 state change to [open] on cs-tool/1

                 (Memory: 7118872 GoRoutines: 27)

DEBU [13:23:12]  Successful SASL handshake. Available mechanisms:  [PLAIN OAUTHBEARER]

                 (Memory: 7250336 GoRoutines: 25)

DEBU [13:23:13]  SASL authentication successful with broker b19-pkc-pgq85.us-west-2.aws.confluent.cloud:9092:4 - [0 0 0 0]

                 (Memory: 7252376 GoRoutines: 25)

DEBU [13:23:13]  Connected to broker at b19-pkc-pgq85.us-west-2.aws.confluent.cloud:9092 (registered as #19)

                 (Memory: 7254288 GoRoutines: 25)

INFO [13:23:13]  POST /api/v1/events/publish - 2298.72ms (StatusCode: 201)

  CorrelationId: f8ae242e1f9311eda62036159d146dfd (Memory: 7268856 GoRoutines: 25)

INFO [13:23:29]  POST /api/v1/events/publish - 288.26ms (StatusCode: 201)

  CorrelationId: 039b42cc1f9411eda62036159d146dfd (Memory: 7312288 GoRoutines: 25)  

DEBU [14:50:34]  producer/broker/19 state change to [closing] because write tcp 172.0.74.2:49324->44.225.47.238:9092: write: broken pipe

                 (Memory: 7117096 GoRoutines: 34)

DEBU [14:50:34]  Error while closing connection to broker b19-pkc-pgq85.us-west-2.aws.confluent.cloud:9092: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.0.74.2:49324->44.225.47.238:9092: write: broken pipe 

                 (Memory: 7119840 GoRoutines: 33)

DEBU [14:50:34]  producer/leader/cs-tool/1 abandoning broker 19

                 (Memory: 7125448 GoRoutines: 33)

DEBU [14:50:34]  producer/broker/19 input chan closed 

                 (Memory: 7128680 GoRoutines: 33)

DEBU [14:50:34]  producer/broker/19 shut down

                 (Memory: 7130312 GoRoutines: 32)

DEBU [14:50:34]  ClientID is the default of 'sarama', you should consider setting it to something application-specific.

                 (Memory: 7132120 GoRoutines: 31)

DEBU [14:50:34]  client/metadata fetching metadata for [cs-tool] from broker b27-pkc-pgq85.us-west-2.aws.confluent.cloud:9092

                 (Memory: 7135168 GoRoutines: 33)

DEBU [14:50:36]  producer/broker/19 starting up

                 (Memory: 7311248 GoRoutines: 37)

DEBU [14:50:36]  producer/broker/19 state change to [open] on cs-tool/1

                 (Memory: 7312872 GoRoutines: 37)

DEBU [14:50:36]  producer/leader/cs-tool/1 selected broker 19

                 (Memory: 7315264 GoRoutines: 37)

DEBU [14:50:38]  Successful SASL handshake. Available mechanisms:  [PLAIN OAUTHBEARER]

                 (Memory: 7444304 GoRoutines: 35)

DEBU [14:50:38]  SASL authentication successful with broker b19-pkc-pgq85.us-west-2.aws.confluent.cloud:9092:4 - [0 0 0 0]

                 (Memory: 7446344 GoRoutines: 35)

DEBU [14:50:38]  Connected to broker at b19-pkc-pgq85.us-west-2.aws.confluent.cloud:9092 (registered as #19)

                 (Memory: 7448256 GoRoutines: 35)

INFO [14:50:39]  POST /api/v1/events/publish - 4401.56ms (StatusCode: 201)

  CorrelationId: 2e288b1a1fa011eda62036159d146dfd (Memory: 7464328 GoRoutines: 35)
Problem Description

In our logs we could see when there is a request to publish to a topic, a broker connection is established due to which our service is taking time but after that, there were consecutively two requests to publish to the same topic out of which for one request there was no latency in publishing to the topic but after the first request, broker connection shut down and during the second request it needs to re-establish that connection which is taking huge time again.

Is there any configuration on the publisher side that could be optimized? because we don't want to close the broker connection that early.

for each partition a broker leader is selected and When a partition leader shuts down for any reason (e.g a broker shuts down), one of its in-sync partition followers should become the new leader. But in our case whenever the leader shuts down it tries to start that leader broker again instead of making a follower broker leader which I believe increases the latency in publishing to the topic.

Any help would be appreciated.

iamnaman863 avatar Aug 23 '22 08:08 iamnaman863

@slaunay , can you help in the above mentioned issue and also is there any configuration in Sarama equivalent of the JVM producer's connections.max.idle.ms?

iamnaman863 avatar Aug 25 '22 10:08 iamnaman863

@slaunay, kindly help

iamnaman863 avatar Sep 05 '22 06:09 iamnaman863

Facing similar issue with v1.36.0

dpaks avatar Oct 07 '22 06:10 dpaks

I got similar errors like that "write connect timeout" when not any message sended and idle status taken ten minutes. then the connection is broken and closed by remote kafka broker.

yyt030 avatar May 16 '23 00:05 yyt030

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

github-actions[bot] avatar Aug 17 '23 04:08 github-actions[bot]

@iamnaman863 Sarama itself won't close any idle connections, but it sounds like you have an intermediate proxy or loadbalancer between your client(s) and the kafka brokers which is terminating tcp connections without a clean shutdown

You could experiment with enabling TCP keepalives using sarama.Config's option https://github.com/IBM/sarama/blob/f93b744cd2b17038f5ce8395cdb183e6683ffbea/config.go#L111-L114

Which should prevent your producer's tcp connections from being closed when idle

dnwe avatar Aug 17 '23 09:08 dnwe

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

github-actions[bot] avatar Nov 15 '23 10:11 github-actions[bot]