amazon-kinesis-client icon indicating copy to clipboard operation
amazon-kinesis-client copied to clipboard

MillisBehindLatest metric across _all_ shards

Open usrenmae opened this issue 7 years ago • 21 comments

Currently several metrics, including MillisBehindLatest are reported to CloudWatch with one of the dimensions being a shard id. On the other side we find it very convenient to set CloudWatch alarms on top of this metric to be able to react, if any shard starts to lag behind. Now it is not possible to set up alerts without specifying the exact name of the shard. This is a limiting factor, because once you add and remove shards constantly, the shard names are being very dynamic and each time they change, you need to change the alarms accordingly, which is frustrating. In general as one want to react to any shard lagging behind, it would be very nice to have a global MillisBehindLatest without relating it to any shard in its dimensions. This can be the maximum across all shards, like MaxMillisBehindLatest.

usrenmae avatar Oct 19 '17 08:10 usrenmae

Kinesis does emit a Stream level metrics for iterator age, called GetRecords.IteratorAgeMillis. You should be able to setup alarm on that metric. That metric can be found under the Kinesis namespace in CloudWatch. If you set the statistic for that metric to Maximum it'll map the maximum millisBehindLatest from all the shards for that given period. Please feel free to reopen the issue, if you still have questions.

sahilpalvia avatar Oct 19 '17 20:10 sahilpalvia

Thanks for informing about the GetRecords.IteratorAgeMilliseconds metric. I wasn't aware of this one. After a closer look into it I figured out it's a global per-stream metric of the Kinesis service. What I'm interested in is a per-consumer metric. We have multiple consumers running on the same stream, some of them may catch up the event feed perfectly, but others may lag. My idea was to have a metric which can tell you which particular consumer is lagging behind. It's not possible to get this information out of the GetRecords.IteratorAgeMilliseconds metric of Kinesis stream itself, but KCL could provide this metric similar way it provides the MillisBehindLatest, but without the shardId dimension. Actually it is not convenient at all to have automation built around any shard-specific metrics, as shards are very dynamic on their own and may change in time, considering the fact that it is not possible to have an alarm on a metric with dimensions, but not specifying the dimension value. When monitoring is build on per-consumer basis, it's much more useful: one can setup permanent alarms on it and only in case of incident it's possible to trace back the particular shard with the shard-specific metrics already. Please re-opening the issue as suggested above.

usrenmae avatar Oct 20 '17 12:10 usrenmae

Thank you for the feedback. We agree with the change you have suggested, and will prioritize it accordingly against the other customer requests we receive.

sahilpalvia avatar Oct 20 '17 17:10 sahilpalvia

@sahilpalvia I also have same use case which we want to scale up/down based on how fast KCL application consumes. this metric will be helpful.

StevenYCChou avatar Feb 06 '18 06:02 StevenYCChou

We have a similar use case and would like this metric as well. We have two kcl consumers on the same kinesis stream. One has a low threshold requirement while the other has a much higher threshold of latency.

We've set the alarm at the lower threshold on the stream, but it alarms once or twice a day because of the higher latency kcl consumer. We have to treat it as an alarm situation each time which obviously causes a lot of time wasted.

We've considered using the shard level metric, however being on the limit of max alarms allowed and having a 60 shard stream, that is not possible currently.

ghost avatar Mar 15 '18 15:03 ghost

@sahilpalvia we also have exact same use case, can you provide any update on this?

akumariiit avatar Oct 07 '18 18:10 akumariiit

We don't have an update at this time. This is a feature we are interested adding, and will prioritize it with all customer requests.

For all of those interested can you please post a reaction on the parent post, this will assist us in prioritizing customer requests.

pfifer avatar Oct 08 '18 19:10 pfifer

+1

waffleshop avatar Oct 08 '18 19:10 waffleshop

+1

vinujan59 avatar Oct 09 '18 04:10 vinujan59

+1

vik7 avatar Oct 09 '18 04:10 vik7

+1

akumariiit avatar Oct 31 '18 07:10 akumariiit

+1

rkass avatar Nov 28 '18 21:11 rkass

+1 We have more than 500 shards in Kinesis and more than 4 KCL application using same Kinesis. In AWS Cloudwatch console, we can not search all shard because Console search result limit is 500. so we do not use KCL Metrics. Although the number of indicators we can graph at one time is limited to 100 in console. This feature is essential for me to check lag of each KCL Application.

winty56 avatar Mar 21 '19 08:03 winty56

+1

kaisermario avatar Jun 14 '21 07:06 kaisermario

@pfifer Any update?

kaisermario avatar Jun 14 '21 07:06 kaisermario

+1

MeisterMasi avatar Jun 14 '21 08:06 MeisterMasi

+1

CCBow-501 avatar Jun 14 '21 08:06 CCBow-501

Hello,

There are service side metrics emitted for monitoring stream-level behind-ness. For consumers using GetRecords, "GetRecords.IteratorAgeMilliseconds" metric will be emitted and all consumer applications will be contributing to this metric. Consumer applications using enhanced fanout will be emitting "SubscribeToShardEvent.MillisBehindLatest" metric along with the consumer name, so status of each consumer can be monitored individually.

Consider using these metrics as an alternative to client-side metrics for monitoring application health.

For more details please refer to: https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html

yasemin-amzn avatar Jun 14 '21 17:06 yasemin-amzn

Hello @yasemin-amzn , "SubscribeToShardEvent.MillisBehindLatest" is a basic (stream level) metric according to: https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html

Stream-level data is sent automatically every minute at no charge.

Unfortunately we can't see this metric in our account.

kaisermario avatar Jun 22 '21 07:06 kaisermario

+1

leifbladt avatar Jun 24 '21 07:06 leifbladt

+1

QwertV2 avatar Nov 30 '21 06:11 QwertV2