amazon-kinesis-client
amazon-kinesis-client copied to clipboard
graceful shutdown of MultiLangDaemon worker that is assigned for completed shards is always timeout
I found that graceful shutdown of MultiLangDaemon worker that is assigned for completed shards is always timeout.
2024-01-31 00:43:51,722 [Thread-1] ERROR s.a.k.multilang.MultiLangDaemon [NONE] - Encountered an error during shutdown.
java.util.concurrent.TimeoutException: null
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1892)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2027)
at software.amazon.kinesis.multilang.MultiLangDaemon.lambda$setupShutdownHook$0(MultiLangDaemon.java:183)
at java.base/java.lang.Thread.run(Thread.java:829)
I read codes and found that two CountDownLatch variables are used to wait all consumers are shutdown completed in shutdown process and both value is the number of worker lease. https://github.com/awslabs/amazon-kinesis-client/blob/v2.5.3/amazon-kinesis-client/src/main/java/software/amazon/kinesis/coordinator/Scheduler.java#L780-L781 https://github.com/awslabs/amazon-kinesis-client/blob/v2.5.2/amazon-kinesis-client/src/main/java/software/amazon/kinesis/coordinator/GracefulShutdownCoordinator.java#L65-L126
If the worker has a lease of completed shard (its checkpoint is SHARD_END
), the number of worker lease is greater than the number of actual consumer subprocess so two latches never becomes to 0 and always timeout.
Is this expected behavior ?
Environment
- Python 3.11
- amazon-kclpy 2.1.3 ( kinesis client library: 2.5.2)
Step to reproduce
To demonstrate this issue, I created a demo application with localstack: https://github.com/nao23/kcl-shutdown-timeout-demo
- Start application
$ docker compose build
$ docker compose up app
- Shutdown MultiLangDaemon process
$ docker compose exec app kill 1
Tried to download your repo but was unable to use docker to run the app. Will see what I can do to produce this without the docker. Feel free to submit a PR if you find a solution as we are open to contributions.
@brendan-p-lynch Thank you for checking. I submitted a PR https://github.com/awslabs/amazon-kinesis-client/pull/1276
@nao23 thanks for providing the demo application, I was able to reproduce the issue using your code as well as on my own separate test suite.
When a worker has a lease of a completed shard, the lease is eventually cleaned up by the LeaseCleanupManager. By default, LeaseCleanupManager checks if there are any completed shards that can be removed every 5 minutes. A completed shard will only be removed if all of its children are being processed and its parent shard lease(s) have been deleted (ref). In your demo code, make sure that the checkpoint for each child shard has been updated (i.e. the checkpoint is not at TRIM_HORIZON
) before starting the graceful shutdown. Also, ensure that the LeaseCleanupManager has enough time to run during the graceful shutdown. When testing, I set shutdownGraceMillis
to 600000 (10 minutes).
See this doc for the default intervals at which the lease cleanup thread is run.