amazon-kinesis-client icon indicating copy to clipboard operation
amazon-kinesis-client copied to clipboard

Race condition in KCL v2 graceful shutdown

Open vtlkvl opened this issue 5 years ago • 4 comments

When graceful shutdown is requested via Scheduler.startGracefulShutdown call, it often happens that all active leases get removed from Scheduler.shardInfoShardConsumerMap before shutdown of record processors is complete and GracefulShutdownContext.shutdownCompleteLatch gets down to 0. This leads to a problem in GracefulShutdownCallable.waitForRecordProcessors:

while (!context.shutdownCompleteLatch().await(1, TimeUnit.SECONDS)) {
    if (Thread.interrupted()) {
        throw new InterruptedException();
    }
    log.info(awaitingFinalShutdownMessage(context));
    if (workerShutdownWithRemaining(context.shutdownCompleteLatch().getCount(), context)) {
        return false;
    }
}

Under normal conditions shutdown complete latch should eventually count down to 0 and future returned by Scheduler.startGracefulShutdown should yield true. Because of a race condition, shutdown complete latch holds a non-zero value and GracefulShutdownCallable.workerShutdownWithRemaining returns true because Scheduler.shardInfoShardConsumerMap is already empty at this point while Scheduler has not finished shutdown process. As a result future returned by Scheduler.startGracefulShutdown yields false. As a workaround to get notified about shutdown completion it is required to check Scheduler.shutdownComplete in a loop until it returns true.

vtlkvl avatar Apr 09 '19 11:04 vtlkvl

Is it possible this causes a shardRecordProcessor not to finish shutdown but to continue executing ProcessTasks ?

BobbyJohansen avatar Apr 11 '19 18:04 BobbyJohansen

Not sure, at least we haven't observed it. Based on my personal analysis of the code, it should not happen.

vtlkvl avatar Apr 12 '19 12:04 vtlkvl

I think some shard consumer is stuck in a failure loop and holding up the graceful shutdown. See #616 for ideas to debug this further.

aggarwal avatar Sep 27 '19 05:09 aggarwal

I believe that this is related https://github.com/awslabs/amazon-kinesis-client/pull/1302

gabrielfmagalhaes avatar Apr 28 '24 06:04 gabrielfmagalhaes