amazon-kinesis-client
amazon-kinesis-client copied to clipboard
Race condition in KCL v2 graceful shutdown
When graceful shutdown is requested via Scheduler.startGracefulShutdown
call, it often happens that all active leases get removed from Scheduler.shardInfoShardConsumerMap
before shutdown of record processors is complete and GracefulShutdownContext.shutdownCompleteLatch
gets down to 0. This leads to a problem in GracefulShutdownCallable.waitForRecordProcessors
:
while (!context.shutdownCompleteLatch().await(1, TimeUnit.SECONDS)) {
if (Thread.interrupted()) {
throw new InterruptedException();
}
log.info(awaitingFinalShutdownMessage(context));
if (workerShutdownWithRemaining(context.shutdownCompleteLatch().getCount(), context)) {
return false;
}
}
Under normal conditions shutdown complete latch should eventually count down to 0 and future returned by Scheduler.startGracefulShutdown
should yield true
. Because of a race condition, shutdown complete latch holds a non-zero value and GracefulShutdownCallable.workerShutdownWithRemaining
returns true
because Scheduler.shardInfoShardConsumerMap
is already empty at this point while Scheduler has not finished shutdown process. As a result future returned by Scheduler.startGracefulShutdown
yields false
. As a workaround to get notified about shutdown completion it is required to check Scheduler.shutdownComplete
in a loop until it returns true
.
Is it possible this causes a shardRecordProcessor not to finish shutdown but to continue executing ProcessTasks ?
Not sure, at least we haven't observed it. Based on my personal analysis of the code, it should not happen.
I think some shard consumer is stuck in a failure loop and holding up the graceful shutdown. See #616 for ideas to debug this further.
I believe that this is related https://github.com/awslabs/amazon-kinesis-client/pull/1302