amazon-kinesis-client graceful shutdown of MultiLangDaemon worker that is assigned for completed shards is always timeout

I found that graceful shutdown of MultiLangDaemon worker that is assigned for completed shards is always timeout.

2024-01-31 00:43:51,722 [Thread-1] ERROR s.a.k.multilang.MultiLangDaemon [NONE] - Encountered an error during shutdown.
java.util.concurrent.TimeoutException: null
 	at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1892)
 	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2027)
 	at software.amazon.kinesis.multilang.MultiLangDaemon.lambda$setupShutdownHook$0(MultiLangDaemon.java:183)
 	at java.base/java.lang.Thread.run(Thread.java:829)

I read codes and found that two CountDownLatch variables are used to wait all consumers are shutdown completed in shutdown process and both value is the number of worker lease. https://github.com/awslabs/amazon-kinesis-client/blob/v2.5.3/amazon-kinesis-client/src/main/java/software/amazon/kinesis/coordinator/Scheduler.java#L780-L781 https://github.com/awslabs/amazon-kinesis-client/blob/v2.5.2/amazon-kinesis-client/src/main/java/software/amazon/kinesis/coordinator/GracefulShutdownCoordinator.java#L65-L126

If the worker has a lease of completed shard (its checkpoint is SHARD_END), the number of worker lease is greater than the number of actual consumer subprocess so two latches never becomes to 0 and always timeout.

Is this expected behavior ?

Environment

Python 3.11
amazon-kclpy 2.1.3 ( kinesis client library: 2.5.2)

Step to reproduce

To demonstrate this issue, I created a demo application with localstack: https://github.com/nao23/kcl-shutdown-timeout-demo

Start application

$ docker compose build
$ docker compose up app

Shutdown MultiLangDaemon process

$ docker compose exec app kill 1

Jan 31 '24 13:01 nao23

Tried to download your repo but was unable to use docker to run the app. Will see what I can do to produce this without the docker. Feel free to submit a PR if you find a solution as we are open to contributions.

Mar 08 '24 23:03 brendan-p-lynch

@brendan-p-lynch Thank you for checking. I submitted a PR https://github.com/awslabs/amazon-kinesis-client/pull/1276

Mar 16 '24 12:03 nao23

@nao23 thanks for providing the demo application, I was able to reproduce the issue using your code as well as on my own separate test suite.

When a worker has a lease of a completed shard, the lease is eventually cleaned up by the LeaseCleanupManager. By default, LeaseCleanupManager checks if there are any completed shards that can be removed every 5 minutes. A completed shard will only be removed if all of its children are being processed and its parent shard lease(s) have been deleted (ref). In your demo code, make sure that the checkpoint for each child shard has been updated (i.e. the checkpoint is not at TRIM_HORIZON) before starting the graceful shutdown. Also, ensure that the LeaseCleanupManager has enough time to run during the graceful shutdown. When testing, I set shutdownGraceMillis to 600000 (10 minutes).

See this doc for the default intervals at which the lease cleanup thread is run.

Apr 05 '24 22:04 vincentvilo-aws

amazon-kinesis-client amazon-kinesis-client copied to clipboard

graceful shutdown of MultiLangDaemon worker that is assigned for completed shards is always timeout

Environment

Step to reproduce

amazon-kinesis-client
amazon-kinesis-client copied to clipboard