cdk-monitoring-constructs icon indicating copy to clipboard operation
cdk-monitoring-constructs copied to clipboard

Better estimate for SQS time to drain metrics

Open r0b0ji opened this issue 2 years ago • 2 comments

Version

v5.2.3

Steps and/or minimal code example to reproduce

It is not actually a bug but a better and simpler computation exist. Currently, time to drain metrics in SQS is calculated as below [1] , which is indirect. A better estimate can be calculated using RATE function [2].

  1. https://github.com/cdklabs/cdk-monitoring-constructs/blob/81f0c6ba0211bca586c9b994ec7aa037b2cd6e3c/lib/monitoring/aws-sqs/SqsQueueMetricFactory.ts#L82-L92
  2. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html

Expected behavior

Instead of directly getting the consumption rate, current computation estimate based on different metrics which is less accurate.

Actual behavior

A better and direct method can be used.

Other details

A sample code for this is

{
    "metrics": [
        [ { "expression": "m1/ABS(RATE(m1))", "label": "TimeToDrain (sec)", "id": "e1", "region": "us-east-1" } ],
        [ "AWS/SQS", "ApproximateNumberOfMessagesVisible", "QueueName", "some-test-queue", { "id": "m1", "visible": false, "region": "us-east-1" } ]
    ],
    "view": "timeSeries",
    "stacked": false,
    "region": "us-east-1",
    "stat": "Average",
    "period": 300
}

r0b0ji avatar Jul 04 '23 00:07 r0b0ji

Also, in the original formula the absolute value of diff need to be taken to avoid getting negative rate impacting the avg and other stats for Time to drain metric. Time to drain can't be negative, if there is no message it will be 0 but current formula adds negative datapoints (though the visibility is capped at 0 min but datapoint are still negative) and which reduces the avg .

r0b0ji avatar Jul 04 '23 00:07 r0b0ji