aws-quota-checker icon indicating copy to clipboard operation
aws-quota-checker copied to clipboard

PromQL expression QOL

Open kedoodle opened this issue 2 years ago • 8 comments

We're updating our alerts to make use of the metrics exposed by the prometheus-exporter feature of aws-quota-checker. We have a generic expression which aims to alert whenever we've breached 70% of any limit.

The expression is quite unweildly:

round( 100 *
    label_replace({__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+(_count|_instances))$")
    / on (resource)
    label_replace({__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+)(_limit)$")
) > 70

A couple suggestions which would aid in crafting PromQL expressions:

  • It would be great if the metrics had an additional label e.g.
    awsquota_rds_instances{resource="rds_instances"}
    awsquota_rds_instances_limit{resource="rds_instances"}
    
  • A bigger change, but what if the metrics were exposed using the same pair of metrics, just with the additional label as above? e.g.
    awsquota_usage{resource="rds_instances"}
    awsquota_limit{resource="rds_instances"}
    

Feel free to disregard if this is too niche or opinionated in a direction you'd rather not take. A solution to those facing similar grievances could be through the use of recording rules.

kedoodle avatar Jan 04 '22 04:01 kedoodle

Hey @kedoodle, thanks for opening this issue. I understand the benefit of switching to the awsquota_usage{resource="rds_instances"} scheme. But what would be the advantage of adding a resource label to the existing metrics?

brennerm avatar Jan 04 '22 13:01 brennerm

Hey @brennerm, appreciate the response!

I'm thinking of a scenario for "generic" expressions where we want to alert on any and all AWS limits reaching a certain threshold (as opposed to a singular resource).

TL;DR: it saves a label_replace or two.

Existing metrics:

awsquota_s3_bucket_count{account="123456789012"}
awsquota_s3_bucket_count_limit{account="123456789012"}

Existing expression (same as original issue comment):

round( 100 *
    label_replace({__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+(_count|_instances))$")
    /
    label_replace({__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+)(_limit)$")
) > 70

Existing metric names with additional resource label:

awsquota_s3_bucket_count{account="123456789012",resource="s3_bucket_count"}
awsquota_s3_bucket_count_limit{account="123456789012",resource="s3_bucket_count"}

New expression with existing metric names with additional resource label:

round( 100 *
    {__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}
    / on (resource)
    {__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}
) > 70

It could also be nice for specific alerts where you want to use the resource as part of the alert details e.g. the alert could have a description (using metric labels) that we have reached 70% of the limit on s3_bucket_count in 123456789012. I understand that you can get the resource from the metric name - it just requires an extra label_replace for a seemingly common use-case.

round( 100 *
    {__name__="awsquota_s3_bucket_count"}
    / on (resource)
    {__name__="awsquota_s3_bucket_count_limit"}
) > 70

kedoodle avatar Jan 04 '22 23:01 kedoodle

@kedoodle I agree with your point of view. I added a new label called quota in https://github.com/brennerm/aws-quota-checker/commit/585f1b62f5b6b3061e56fca5561f3c5e0d7f6ea7 that contains the quota name. Could you provide feedback on that change? If it works for you I'll create a new release.

I'll probably also switch to the proposed awsquota_usage and awsquota_limit scheme at some point in time but that'll be part of a new major release as it's a breaking change.

brennerm avatar Jan 05 '22 19:01 brennerm

Hey @brennerm, I've built and deployed from 585f1b6. The new label looks great!

image

image

I understand with awsquota_usage and awsquota_limit being a breaking change. Would love to see it in a future release.

kedoodle avatar Jan 05 '22 22:01 kedoodle

That's great to hear. The change has been released with version 1.10.0.

I'll leave the ticket open until I switch to the breaking change scheme.

brennerm avatar Jan 06 '22 11:01 brennerm

Thanks @brennerm!

I'm in the process of deploying 1.10.0 into a few different k8s clusters. Probably unrelated to #31, but I'm seeing some high spikes in memory (~800 MiB) usage during refreshing current values. I've increased memory limits and will let you know (in another issue?) next week if the spikes persisted over the weekend.

Container logs, after which the pod is OOMKilled:

AWS profile: default | AWS region: ap-southeast-2 | Active checks: cf_stack_count,ebs_snapshot_count,rds_instances,s3_bucket_count
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - starting /metrics endpoint on port 8080
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - collecting checks
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - collected 4 checks
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - refreshing limits
07-Jan-22 04:46:34 [INFO] aws_quota.prometheus - limits refreshed
07-Jan-22 04:46:34 [INFO] aws_quota.prometheus - refreshing current values

EDIT: Given enough memory, we can see it takes 3 minutes 30 seconds to refresh current values:

07-Jan-22 05:04:06 [INFO] aws_quota.prometheus - refreshing current values
07-Jan-22 05:07:36 [INFO] aws_quota.prometheus - current values refreshed

This particular AWS account has ~35k EBS snapshots. I suspect pagination may be needed to reduce memory usage during any one particular check e.g. https://github.com/brennerm/aws-quota-checker/blob/1.10.0/aws_quota/check/ebs.py#L13 for my scenario.

EDIT 2: Did some troubleshooting given that most people probably don't have an AWS account with 35k EBS snapshots handy. PR opened #32.

kedoodle avatar Jan 07 '22 04:01 kedoodle

Hello @kedoodle

thanks for your work. Your expression doesn't work with this metric

awsquota_elb_listeners_per_clb

We are trying to find a new alert rule, we will get back to you !

Thanks !

tpoindessous avatar May 19 '22 09:05 tpoindessous

Hello @kedoodle

thanks for your work. Your expression doesn't work with this metric

awsquota_elb_listeners_per_clb

We are trying to find a new alert rule, we will get back to you !

Thanks !

Hopefully you can adapt the expression to something that works for your use case in leiu of the proposed awsquota_usage and awsquota_limit breaking change being implemented.

kedoodle avatar May 27 '22 04:05 kedoodle