amazon-kinesis-client
amazon-kinesis-client copied to clipboard
After shard splitting, our log is flooded with warning messages "Cannot find the shard given shardId"
ShardSyncTask is to run either on worker initialization or when the worker detects one of its assigned shards completes. In the event of shard split, if however the child shard falls on a worker that's not previously processing the parent shard, this worker will not run the ShardSyncTask because none of its previously assigned shards have completed.
Meanwhile, the lease coordinator has timer tasks to sync up with the Dynamo table to assign itself shards to process.
So we end up with the worker start processing the child shard while at the same time, keeps logging a warning message from line 208 of KinesisProxy:
LOG.warn("Cannot find the shard given the shardId " + shardId);
As far as I understand, the shard info is needed only for de-aggregation to discard user records that are supposed to be re-routed to other shards during resharding. So we are not experiencing dropped records or sth severe, it's just the flooding of our log, and maybe some duplicates as we are using KPL aggregation on the producer side.
I have the same warning. Does anyone know how to fix it?
We are facing similar issue. Any advice on a solution would be appreciated?
@xujiaxj just curious if you thought of anything since the bug filing?
@amanduggal we modified our logback setting to suppress the warning message
logger name="com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy" level="ERROR"
This is especially annoying when using the KCL to read a dynamodb stream, which claims to split it's shards every 4 hours according to this blog post by one of the DynamoDB engineers at AWS:
Typically, shards in DynamoDB streams close for writes roughly every four hours after they are created and become completely unavailable 24 hours after they are created.
https://blogs.aws.amazon.com/bigdata/post/TxFCI3UJJJYEXJ/Process-Large-DynamoDB-Streams-Using-Multiple-Amazon-Kinesis-Client-Library-KCL
Just letting people know that we are aware of this. We're looking into fixing this, but I don't have an ETA at this time.
I ran into this using DynamoDB Streams without explicit shard splitting occurring (just the usual DynamoDB cycling of the shard as @matthewbogner described). FWIW, here is the sequence we encountered that triggered the warnings. With DynamoDB Streams this occurs pretty often--at any given point in time there's almost always at least one of our servers in this state where it's logging these warnings every 2 seconds. We've had to turn off WARN for KinesisProxy and ProcessTask.
Assume a DynamoDB stream with shard S1 and two stream workers A and B using the KCL (we aren't using the KPL):
- At the start, consumer A owns a lease on shard S1, consumer B is idle because no leases are available.
- At some point, DynamoDB closes shard S1 and creates a child shard S2 whose parent is S1.
- A reaches the end of S1.
- A syncs the shard set with the DynamoDB lease table, creating a new lease for S2.
- A obtains the lease for S2. It hasn't yet cleaned up the lease for S1.
- B wakes up, notices 2 leases in the lease table both owned by A (S1 and S2), and steals the S2 lease from A (code).
- A notices that S2 lease has been lost, becomes idle.
- B begins processing records in S2.
- B logs warnings because it did not execute a code path that would cause it to re-sync its cached list of shards to include S2 (code).
KinesisProxyinitializes its cached shard list on startupKinesisProxycached shard list is refreshed upon reaching the end of a shardKinesisProxycached shard list is NOT refreshed on lease steal
- B continues to log warnings until it reaches the end of shard S2.
- .. at which point, A may steal the lease for the new S3 and begin logging warnings.
Been testing the stack and looking at the sharding and been noticing these errors, although everything continues to appear to work.
Forgive my newness to the technology, but is this something that we should be concerned about?
Unsure why this is labelled an enhancement?
Any updates on this? If I understood correctly from @shawnsmith's analysis, the solution is to refresh cached shard list on lease steal?
From 2016:
Just letting people know that we are aware of this. We're looking into fixing this, but I don't have an ETA at this time.
Is this still the case?
Just copying this over from the linked issue. @pfifer Do you have any updates or insight here?
I think I have the same issue, although we also see non-stop
ERRORlogs like:ERROR [2020-02-27 13:02:45,382] [RecordProcessor-2873] c.a.s.k.c.lib.worker.InitializeTask: Caught exception: com.amazonaws.services.kinesis.clientlibrary.exceptions.internal.KinesisClientLibIOException: Unable to fetch checkpoint for shardId shardId-00000001582460850801-53f6f94b at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.getCheckpointObject(KinesisClientLibLeaseCoordinator.java:286) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitializeTask.call(InitializeTask.java:82) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
And I found an AWS Dev forum link related to this issue here: https://forums.aws.amazon.com/thread.jspa?messageID=913872
@pfifer any ideas? any updates? any anything?
@pfifer any updates on this?
@aldilaff might want to check #185 too, in case the workerId is also bugging you.
Cannot find the shard given the shardId
chgenvulgfjlejltgvglhecbucrihrcbbclfj
@joshua-kim was that a yubikey press? :-P Otherwise, can you please elaborate on why the issue is being closed and how to solve/prevent it?
@igracia Sorry, yes that was a Yubikey press. I was referencing this issue when looking into another cached shard map issue in a fork of 1.6; I'm curious though, are you still seeing this on the latest 2.x/1.x releases? The latest releases are no longer using ListShards in most cases, so I'm curious to see if this bug is still present.
Thanks @joshua-kim! We have several consumers using the DynamoDB Streams Kinesis adapter on a shingle shard, and still getting this with the following versions
- dynamodb-streams-kinesis-adapter 1.5.1
- amazon-kinesis-client 1.13.3
Bumping those versions makes it all stop working, so we're stuck with them for the time being. Also, as per this issue in dynamodb-streams-kinesis-adapter, we can't use v2. Any suggestions would be appreciated!
Same problem 6 years later :disappointed:
I'm using amazon-kinesis-client 1.13.3 with dynamodb-streams-kinesis-adapter 1.5.3 This is especially annoying in combination with the already spammy MultilangDaemon
The KCL dev flow has been quite stable in the many years I've been using it.
- wire in the KCL library
- be surprised about how much boilerplate handling code is required, without much supporting documentation, particularly on how to handle errors safely
- be alarmed about sporadic, opaque but continual warnings logged in your production deployments
- spend time googling and pursuing old open github issues with unclear resolutions
- give up and set log level to ERROR and cross your fingers. Hopefully you're not dealing with a domain where data loss is a serious issue. Or switch to Lambda.