confluent-kafka-dotnet icon indicating copy to clipboard operation
confluent-kafka-dotnet copied to clipboard

MSK connectivity issue during AWS Security Patch Updates

Open DevOnRun opened this issue 1 year ago • 6 comments

Description

Facing issues while consuming event from Kafka using AWS MSK during security patch updates.

How to reproduce

  1. Launch an Consumer application using AWS MSK as Kafka infrastructure.
  2. Wait for roll out MSK updates or applying security patch automatically or apply manually if possible

Additional Details

On further observation while debugging Error.Code returned as Local_Transport

Checklist

Program: Basic Consumer application (Regularly consume events) Confluent.Kafka nuget version: 2.2.0 Apache Kafka version: 2.8.1 Client configuration: EnableAutoCommit = false; EnableAutoOffsetStore = false;

Info Logs: ssl://b-1.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094/1: Connect to ipv4#<ip-address-1>:9094 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed) [thrd:ssl://b-1.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaw]: ssl://b-1.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094/1: Connect to ipv4#<ip-address-1>:9094 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed) 2/2 brokers are down ssl://b-1.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after -1616061376ms in state UP) GroupCoordinator: b-2.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094: Connect to ipv4#<ip-address-3>:9094 failed: Connection refused (after 1ms in state CONNECT, 1 identical error(s) suppressed) [thrd:GroupCoordinator]: GroupCoordinator: b-2.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094: Connect to ipv4#<ip-address-3>:9094 failed: Connection refused (after 1ms in state CONNECT, 1 identical error(s) suppressed) ssl://b-2.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094/2: Connect to ipv4#<ip-address-3>:9094 failed: Connection refused (after 1ms in state CONNECT, 1 identical error(s) suppressed) [thrd:ssl://b-2.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaw]: ssl://b-2.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094/2: Connect to ipv4#<ip-address-3>:9094 failed: Connection refused (after 1ms in state CONNECT, 1 identical error(s) suppressed)

Please provide the following information:

  • [x] #2191
  • [x] Confluent.Kafka nuget version.
  • [x] Apache Kafka version.
  • [x] Client configuration.
  • [ ] Operating system.
  • [x] Provide logs (with "debug" : "..." as necessary in configuration).
  • [ ] Provide broker log excerpts.
  • [ ] Critical issue.

DevOnRun avatar Jan 03 '24 09:01 DevOnRun

Similar unanswered Issues:

  • https://github.com/confluentinc/librdkafka/issues/3569
  • https://github.com/confluentinc/librdkafka/discussions/4188

DevOnRun avatar Jan 13 '24 06:01 DevOnRun

Any updates on this? I am facing same issue during security patching for MSK cluster.

prashantalhat avatar Jan 24 '24 05:01 prashantalhat

Was the broker reachable? Can you provide more logs?

anchitj avatar Feb 13 '24 13:02 anchitj

On further enhancing logs by adding remaining properties for SetLogHandler() and SetErrorHandler() implementation found:

  • For SetErrorHandler(), each "Error" object have 'Code' as Local_Transport or Local_AllBrokersDown and Reason (as shared above).
  • For SetLogHandler(), each "LogMessage" object have 'Name' as rdkafka#consumer-1 or rdkafka#producer-1, 'Facility' as FAIL and Message (as shared above).

NOTE: All the above logs are produced with loglevel as either of Info/Warning/Error. Nothing else is produced even after enabling Debug loglevel

DevOnRun avatar Feb 20 '24 04:02 DevOnRun

@anchitj Getting same error after MSK patching activity and after this Kafka client(Consumer code) not able to connect again, only option is to restart the pods(service). Kindly let me know of there is any way to configure consumer code to re-initiates the connection

%4|1710193417.943|FAIL|rdkafka#consumer-9886| [thrd:sasl_ssl://brokeraddress.amazonaws.com]: sasl_ssl://b-3.msk-wbrokeraddress.amazonaws.com:9096/3: Connection setup timed out in state APIVERSION_QUERY (after 29924ms in state APIVERSION_QUERY, 1 identical error(s) suppressed) %4|1710193447.946|FAIL|rdkafka#consumer-9886| [thrd:sasl_ssl://b-2.brokeraddress.amazonaws.com]: sasl_ssl://b-2.msk-wbroker.amazonaws.com:9096/2: Connection setup timed out in state APIVERSION_QUERY (after 29912ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)

neerajk22 avatar Mar 15 '24 03:03 neerajk22

Client should keep retrying on its own and this error should be transient. Please try to reproduce once again with Debug="all" and upload the logs here.

anchitj avatar Apr 17 '24 11:04 anchitj