amazon-kinesis-client After shard splitting, our log is flooded with warning messages "Cannot find the shard given shardId"

ShardSyncTask is to run either on worker initialization or when the worker detects one of its assigned shards completes. In the event of shard split, if however the child shard falls on a worker that's not previously processing the parent shard, this worker will not run the ShardSyncTask because none of its previously assigned shards have completed.

Meanwhile, the lease coordinator has timer tasks to sync up with the Dynamo table to assign itself shards to process.

So we end up with the worker start processing the child shard while at the same time, keeps logging a warning message from line 208 of KinesisProxy:

LOG.warn("Cannot find the shard given the shardId " + shardId);

As far as I understand, the shard info is needed only for de-aggregation to discard user records that are supposed to be re-routed to other shards during resharding. So we are not experiencing dropped records or sth severe, it's just the flooding of our log, and maybe some duplicates as we are using KPL aggregation on the producer side.

Jan 12 '16 00:01 xujiaxj

I have the same warning. Does anyone know how to fix it?

May 23 '16 13:05 aakavalevich

We are facing similar issue. Any advice on a solution would be appreciated?

@xujiaxj just curious if you thought of anything since the bug filing?

Jul 28 '16 20:07 amanduggal

@amanduggal we modified our logback setting to suppress the warning message

logger name="com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy" level="ERROR"

Jul 28 '16 20:07 xujiaxj

This is especially annoying when using the KCL to read a dynamodb stream, which claims to split it's shards every 4 hours according to this blog post by one of the DynamoDB engineers at AWS:

Typically, shards in DynamoDB streams close for writes roughly every four hours after they are created and become completely unavailable 24 hours after they are created.

https://blogs.aws.amazon.com/bigdata/post/TxFCI3UJJJYEXJ/Process-Large-DynamoDB-Streams-Using-Multiple-Amazon-Kinesis-Client-Library-KCL

Jul 28 '16 20:07 matthewbogner

Just letting people know that we are aware of this. We're looking into fixing this, but I don't have an ETA at this time.

Aug 16 '16 17:08 pfifer

I ran into this using DynamoDB Streams without explicit shard splitting occurring (just the usual DynamoDB cycling of the shard as @matthewbogner described). FWIW, here is the sequence we encountered that triggered the warnings. With DynamoDB Streams this occurs pretty often--at any given point in time there's almost always at least one of our servers in this state where it's logging these warnings every 2 seconds. We've had to turn off WARN for KinesisProxy and ProcessTask.

Assume a DynamoDB stream with shard S1 and two stream workers A and B using the KCL (we aren't using the KPL):

At the start, consumer A owns a lease on shard S1, consumer B is idle because no leases are available.
At some point, DynamoDB closes shard S1 and creates a child shard S2 whose parent is S1.
A reaches the end of S1.
A syncs the shard set with the DynamoDB lease table, creating a new lease for S2.
A obtains the lease for S2. It hasn't yet cleaned up the lease for S1.
B wakes up, notices 2 leases in the lease table both owned by A (S1 and S2), and steals the S2 lease from A (code).
A notices that S2 lease has been lost, becomes idle.
B begins processing records in S2.
B logs warnings because it did not execute a code path that would cause it to re-sync its cached list of shards to include S2 (code).
- KinesisProxy initializes its cached shard list on startup
- KinesisProxy cached shard list is refreshed upon reaching the end of a shard
- KinesisProxy cached shard list is NOT refreshed on lease steal
B continues to log warnings until it reaches the end of shard S2.
.. at which point, A may steal the lease for the new S3 and begin logging warnings.

Oct 05 '17 18:10 shawnsmith

Been testing the stack and looking at the sharding and been noticing these errors, although everything continues to appear to work.

Forgive my newness to the technology, but is this something that we should be concerned about?

Feb 10 '18 12:02 ryanlewis

Unsure why this is labelled an enhancement?

Feb 21 '18 00:02 adrian-baker

Any updates on this? If I understood correctly from @shawnsmith's analysis, the solution is to refresh cached shard list on lease steal?

Dec 14 '18 10:12 klesniewski

From 2016:

Just letting people know that we are aware of this. We're looking into fixing this, but I don't have an ETA at this time.

Is this still the case?

Mar 07 '19 01:03 adrian-baker

Just copying this over from the linked issue. @pfifer Do you have any updates or insight here?

I think I have the same issue, although we also see non-stop ERROR logs like:

ERROR [2020-02-27 13:02:45,382] [RecordProcessor-2873] c.a.s.k.c.lib.worker.InitializeTask: Caught exception: 
com.amazonaws.services.kinesis.clientlibrary.exceptions.internal.KinesisClientLibIOException: Unable to fetch checkpoint for shardId shardId-00000001582460850801-53f6f94b
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.getCheckpointObject(KinesisClientLibLeaseCoordinator.java:286)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitializeTask.call(InitializeTask.java:82)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

And I found an AWS Dev forum link related to this issue here: https://forums.aws.amazon.com/thread.jspa?messageID=913872

Feb 28 '20 17:02 mrhota

@pfifer any ideas? any updates? any anything?

Mar 26 '20 18:03 mrhota

@pfifer any updates on this?

Aug 13 '20 19:08 aldilaff

@aldilaff might want to check #185 too, in case the workerId is also bugging you.

Sep 09 '20 08:09 igracia

Cannot find the shard given the shardId

chgenvulgfjlejltgvglhecbucrihrcbbclfj

Oct 23 '20 16:10 joshua-kim

@joshua-kim was that a yubikey press? :-P Otherwise, can you please elaborate on why the issue is being closed and how to solve/prevent it?

Oct 26 '20 17:10 igracia

@igracia Sorry, yes that was a Yubikey press. I was referencing this issue when looking into another cached shard map issue in a fork of 1.6; I'm curious though, are you still seeing this on the latest 2.x/1.x releases? The latest releases are no longer using ListShards in most cases, so I'm curious to see if this bug is still present.

Oct 26 '20 23:10 joshua-kim

Thanks @joshua-kim! We have several consumers using the DynamoDB Streams Kinesis adapter on a shingle shard, and still getting this with the following versions

dynamodb-streams-kinesis-adapter 1.5.1
amazon-kinesis-client 1.13.3

Bumping those versions makes it all stop working, so we're stuck with them for the time being. Also, as per this issue in dynamodb-streams-kinesis-adapter, we can't use v2. Any suggestions would be appreciated!

Oct 28 '20 07:10 igracia

Same problem 6 years later :disappointed:

I'm using amazon-kinesis-client 1.13.3 with dynamodb-streams-kinesis-adapter 1.5.3 This is especially annoying in combination with the already spammy MultilangDaemon

Oct 04 '22 01:10 dacevedo12

The KCL dev flow has been quite stable in the many years I've been using it.

wire in the KCL library
be surprised about how much boilerplate handling code is required, without much supporting documentation, particularly on how to handle errors safely
be alarmed about sporadic, opaque but continual warnings logged in your production deployments
spend time googling and pursuing old open github issues with unclear resolutions
give up and set log level to ERROR and cross your fingers. Hopefully you're not dealing with a domain where data loss is a serious issue. Or switch to Lambda.

Oct 04 '22 02:10 adrian-skybaker

amazon-kinesis-client amazon-kinesis-client copied to clipboard

After shard splitting, our log is flooded with warning messages "Cannot find the shard given shardId"

amazon-kinesis-client
amazon-kinesis-client copied to clipboard