amazon-kinesis-client icon indicating copy to clipboard operation
amazon-kinesis-client copied to clipboard

KCL 2.3.0 - Intermittent shard connectivity loss with recurring error "Last request was dispatched at ... Cancelling subscription, and restarting"

Open priyath opened this issue 3 years ago • 1 comments

Setup: KCL 2.3.0 (enhanced fan-out DISABLED)

We have a kinesis stream with 24 shards and a KCL 2.3.0 based consumer, consuming records from the stream running inside a VPC. Recently, we have noticed an issue where the consumer loses its connection with the stream due to connection time outs. The following errors were observed in the logs:

software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: connection timed out: kinesis.us-east-1.amazonaws.com

Usually the consumer recovers from these errors after a brief period and resumes consuming data so that is not a major problem. But in certain cases, the recovery is only partial, meaning records are NOT processed from some shards. The only resolution when this happens is an application restart. The following recurring error can also be observed in the logs for the failed shards until the restart is performed:

[30m2021-05-24 20:10:13,679[0;39m [Thread-2] [1;31mERROR[0;39m [36ms.a.k.l.ShardConsumerSubscriber[0;39m - shardId-000000000050: Last request was dispatched at 2021-05-24T20:09:13.668571Z, but no response as of 2021-05-24T20:10:13.679148Z (PT1M0.010577S).  Cancelling subscription, and restarting. Last successful request details -- request id - NONE, timestamp - NONE 

Any tips on how to debug this issue as it can potentially result in data loss for certain shards due to the partial recovery.

priyath avatar Jun 13 '21 02:06 priyath

We are facing the same issue... restarting a service in production does not makes sense for enterprise highly critical data. @priyath Please share if you find some way out of this

fsiddiqui-mdsol avatar Sep 29 '21 21:09 fsiddiqui-mdsol