aws-quota-checker
aws-quota-checker copied to clipboard
PromQL expression QOL
We're updating our alerts to make use of the metrics exposed by the prometheus-exporter
feature of aws-quota-checker
. We have a generic expression which aims to alert whenever we've breached 70% of any limit.
The expression is quite unweildly:
round( 100 *
label_replace({__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+(_count|_instances))$")
/ on (resource)
label_replace({__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+)(_limit)$")
) > 70
A couple suggestions which would aid in crafting PromQL expressions:
- It would be great if the metrics had an additional label e.g.
awsquota_rds_instances{resource="rds_instances"} awsquota_rds_instances_limit{resource="rds_instances"}
- A bigger change, but what if the metrics were exposed using the same pair of metrics, just with the additional label as above? e.g.
awsquota_usage{resource="rds_instances"} awsquota_limit{resource="rds_instances"}
Feel free to disregard if this is too niche or opinionated in a direction you'd rather not take. A solution to those facing similar grievances could be through the use of recording rules.
Hey @kedoodle, thanks for opening this issue. I understand the benefit of switching to the awsquota_usage{resource="rds_instances"}
scheme. But what would be the advantage of adding a resource label to the existing metrics?
Hey @brennerm, appreciate the response!
I'm thinking of a scenario for "generic" expressions where we want to alert on any and all AWS limits reaching a certain threshold (as opposed to a singular resource).
TL;DR: it saves a label_replace
or two.
Existing metrics:
awsquota_s3_bucket_count{account="123456789012"}
awsquota_s3_bucket_count_limit{account="123456789012"}
Existing expression (same as original issue comment):
round( 100 *
label_replace({__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+(_count|_instances))$")
/
label_replace({__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+)(_limit)$")
) > 70
Existing metric names with additional resource
label:
awsquota_s3_bucket_count{account="123456789012",resource="s3_bucket_count"}
awsquota_s3_bucket_count_limit{account="123456789012",resource="s3_bucket_count"}
New expression with existing metric names with additional resource
label:
round( 100 *
{__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}
/ on (resource)
{__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}
) > 70
It could also be nice for specific alerts where you want to use the resource as part of the alert details e.g. the alert could have a description (using metric labels) that we have reached 70% of the limit on s3_bucket_count
in 123456789012
. I understand that you can get the resource
from the metric name - it just requires an extra label_replace
for a seemingly common use-case.
round( 100 *
{__name__="awsquota_s3_bucket_count"}
/ on (resource)
{__name__="awsquota_s3_bucket_count_limit"}
) > 70
@kedoodle I agree with your point of view. I added a new label called quota
in https://github.com/brennerm/aws-quota-checker/commit/585f1b62f5b6b3061e56fca5561f3c5e0d7f6ea7 that contains the quota name.
Could you provide feedback on that change? If it works for you I'll create a new release.
I'll probably also switch to the proposed awsquota_usage
and awsquota_limit
scheme at some point in time but that'll be part of a new major release as it's a breaking change.
Hey @brennerm, I've built and deployed from 585f1b6. The new label looks great!
I understand with awsquota_usage
and awsquota_limit
being a breaking change. Would love to see it in a future release.
That's great to hear. The change has been released with version 1.10.0.
I'll leave the ticket open until I switch to the breaking change scheme.
Thanks @brennerm!
I'm in the process of deploying 1.10.0
into a few different k8s clusters. Probably unrelated to #31, but I'm seeing some high spikes in memory (~800 MiB) usage during refreshing current values
. I've increased memory limits and will let you know (in another issue?) next week if the spikes persisted over the weekend.
Container logs, after which the pod is OOMKilled
:
AWS profile: default | AWS region: ap-southeast-2 | Active checks: cf_stack_count,ebs_snapshot_count,rds_instances,s3_bucket_count
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - starting /metrics endpoint on port 8080
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - collecting checks
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - collected 4 checks
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - refreshing limits
07-Jan-22 04:46:34 [INFO] aws_quota.prometheus - limits refreshed
07-Jan-22 04:46:34 [INFO] aws_quota.prometheus - refreshing current values
EDIT: Given enough memory, we can see it takes 3 minutes 30 seconds to refresh current values:
07-Jan-22 05:04:06 [INFO] aws_quota.prometheus - refreshing current values
07-Jan-22 05:07:36 [INFO] aws_quota.prometheus - current values refreshed
This particular AWS account has ~35k EBS snapshots. I suspect pagination may be needed to reduce memory usage during any one particular check e.g. https://github.com/brennerm/aws-quota-checker/blob/1.10.0/aws_quota/check/ebs.py#L13 for my scenario.
EDIT 2: Did some troubleshooting given that most people probably don't have an AWS account with 35k EBS snapshots handy. PR opened #32.
Hello @kedoodle
thanks for your work. Your expression doesn't work with this metric
awsquota_elb_listeners_per_clb
We are trying to find a new alert rule, we will get back to you !
Thanks !
Hello @kedoodle
thanks for your work. Your expression doesn't work with this metric
awsquota_elb_listeners_per_clb
We are trying to find a new alert rule, we will get back to you !
Thanks !
Hopefully you can adapt the expression to something that works for your use case in leiu of the proposed awsquota_usage
and awsquota_limit
breaking change being implemented.