containers-roadmap
containers-roadmap copied to clipboard
[EKS] Managed NodeGroups: Enable Group Metrics Collection for created ASG
Request Add an option in the managed node groups to enable Group Metrics Collection for the created ASG
Which service(s) is this request for? EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I'm trying to collect more metrics to have a good overview on the instances in service and keep track of the recreated nodes
Are you currently working around this issue? We enable the metrics collection for the groups manually after they are created
Description The managed node groups create an ASG which is fully managed by the node groups and have Group Metrics Collection disabled by default. This is to enable more enhanced monitoring
I come across this proposal when I try to enable metrics for node_groups
when using terraform-aws-modules/eks/aws
. I really think this is a feature we much need.
Any update on this issue?
[updates] we filed a support ticket with AWS on this, and they suggest us to add voice to this thread, and turn on node group metrics manually as a workaround :/
Same suggestion from AWS, adding a +1 here to try influencing that roadmap.
The ASG backing a managed node group is meant to be more of an implementation detail. I realize there is no charge for enabling this, so it is something we could do, but I'd like also to hear more details about what problems you are trying to solve that enabling ASG metrics would help with, that can't be currently solved by more Kubernetes native metrics options like Container Insights or Prometheus.
Hey, @mikestef9, thanks for the quick follow-up. We just started using EKS so maybe there is a better way of doing this.
Currently, we create a NodeGroup
with ScalingConfig
that has MinSize
and a MaxSize
. We ran into an issue not too long ago, where the number of healthy nodes went below the MinSize
for a few mins. If this happened in the future we wanted to alert on it. We use Datadog, and we could create an alert where if healthy nodes are less than let's say 10, alert us. We wanted to make the alert more dynamic, and get the actual MinSize
of the ASG. In case we change it in the future we don't have to change the alert.
Do you think there is a better way of achieving this alert? maybe this type of alert is not very useful when we are talking about EKS?
edit: To allow DataDog to collect ASG metrics, we have to enable MetricsCollection
in the ASGs we want to monitor.
Similar to what @javs-perez has mentioned, we are using Datadog and wish to alert on capacity e.g. % of running nodes out of the max size.
We had a problem where our cluster autoscaler had scaled to max capacity set for the managed node group, so we had pending pods due to insufficient resources. We can remedy this via pending pods potentially, but having these metrics would certainly be beneficial.
I have created a script to automate this on my CI/CD pipeline. It only uses awscli and jq . So someone might benefit.. https://gist.github.com/cdalar/f5749040ccb7487203738a134767e3fc
Note: change it according to your need like --regions etc.
# Get's the FIRST Cluster on list-clusters. Assuming you only have 1 EKS
EKS_CLUSTER_NAME=$(aws eks list-clusters --region=eu-central-1 | jq -r .clusters[0])
echo $EKS_CLUSTER_NAME
# First NodeGroup from the list.
NG=$(aws eks list-nodegroups --cluster-name $EKS_CLUSTER_NAME | jq -r '.nodegroups[0]')
echo $NG
# First Autoscaling Group Name
ASG_NAME=$(aws eks describe-nodegroup --cluster-name $EKS_CLUSTER_NAME --nodegroup-name $NG | jq -r '.nodegroup.resources.autoScalingGroups[0].name')
# Enable Autoscaling Group Metrics
aws autoscaling enable-metrics-collection --auto-scaling-group-name $ASG_NAME --granularity "1Minute"
# --------- Extra ----------
# Get SNS Topic ARN for Alarms.
SNS_ARN=$(aws sns list-topics | jq -r '.Topics[0].TopicArn')
# EKS Autoscaling Capacity Alarm
EKS_ASG_MAX_SIZE=$(aws cloudformation describe-stacks | jq -r --arg EKS_CLUSTER_NAME "$EKS_CLUSTER_NAME" '.Stacks[] | select( .StackName == $EKS_CLUSTER_NAME+"-eks-nodegroup")' | jq -r '.Parameters[] | select(.ParameterKey == "EksAsgMaxSize") | .ParameterValue')
aws cloudwatch put-metric-alarm --alarm-name "${EKS_CLUSTER_NAME}-EKS NodeGroup EksAsgCapacityAlarm" --evaluation-periods 1 --comparison-operator GreaterThanOrEqualToThreshold --metric-name GroupTotalInstances --period 600 --namespace AWS/AutoScaling --statistic Maximum --threshold $EKS_ASG_MAX_SIZE --dimensions Name=AutoScalingGroupName,Value=$ASG_NAME --ok-actions $SNS_ARN --alarm-actions $SNS_ARN
Any update on this issue? I'm hoping that AWS Managed Services will provide a seamless integration. I'm sick of manually manipulating it.
Hi @mikestef9,
I will describe my use case. We use Datadog to monitor Kubernetes/EKS etc... In most cases, yes, you can use other Kubernetes metrics without depending on the ASG.
But there's a case where it's really useful. Imagine you scale your ASG to zero, or delete the node group. What happens in that case is that the Datadog DaemonSet (or cloudwatch container insights) will be uninstalled (there's no more nodes available). That way you stop receiving metrics from K8S and no longer know if you have nodes running or not.
With ASG metrics available, we can catch this case by monitoring the ASG metrics for running instances etc... Those won't stop as they come from the AWS integration of Datadog.
Also, if the ASG metrics are free, why not enable them by default? Will it cause any issue to anyone? I guess not. So maybe there's no need to provide an option to enable/disable.
Just enable it by default! :)
@mikestef9 in my use case we also use DataDog for monitoring and we have alerts for when the ASG is at or near max capacity. I'm not sure of a way to track that directly against the EKS Managed Node Group resources and as far as I know CloudWatch doesn't have any metrics for EKS?
Moved to in progress. We are going to enable this flag for newly created managed node groups. Follow this issue for further updates.
import boto3
eks = boto3.client('eks')
autoscaling = boto3.client('autoscaling')
clusters = eks.list_clusters()['clusters']
for cluster in clusters:
print(f'cluster: {cluster}')
nodegroups=eks.list_nodegroups(clusterName=cluster)["nodegroups"]
for nodegroup in nodegroups:
print(f'*nodegroup: {nodegroup}')
autoScalingGroups = eks.describe_nodegroup(clusterName=cluster,nodegroupName=nodegroup)["nodegroup"]["resources"]["autoScalingGroups"]
for autoScalingGroup in autoScalingGroups:
print(f'##autoScalingGroup: {autoScalingGroup["name"]}')
metricsResult = autoscaling.enable_metrics_collection(AutoScalingGroupName=autoScalingGroup["name"],Granularity="1Minute")
print(f'@@@metricsResult: {metricsResult["ResponseMetadata"]["HTTPStatusCode"]}')
pyhton script to activate metrics on all asgs.
Hey, any news on this issue?
Any updates?
Any Updates on this ?
Any Updates on this ?
Is there update re this case ?
Is there any update on this ? Our organization also needed this feature
We also need this. Any update?
+1 for this, it would be great to be able to configure this for managed node groups!
hey @mikestef9, do we have any updates on the progress of the implementation of this ?
There is a new blog published today to enable this functionality using EventBridge and Lambda: https://aws.amazon.com/blogs/containers/automatically-enable-group-metrics-collection-for-amazon-eks-managed-node-groups/
In all seriousness, this is not a solution people are asking for. This is temporary workaround and offloads feature implementation to customer, while everyone here are expecting this to be managed (as in Managed Node Group) solution from aws.
I agree. Ideally, enabling auto scaling group metrics would be exposed as a field in the MNG API.
Yeah I would have liked this to be exposed in the aws terraform provider when creating a aws_eks_node_group
It feels like it should just be another attribute to pass in true false etc
If exposing an attribute takes too long to implement, why not change the behaviour to default to true
since this metrics are free. That way later on you can add an attribute for people that want to disable it.... I would image this should be a very quick implementation to do.
If exposing an attribute takes too long to implement, why not change the behaviour to default to
true
since this metrics are free. That way later on you can add an attribute for people that want to disable it.... I would image this should be a very quick implementation to do.
I dont think its free. They are ingested into cloudwatch and you still pay for it. I think that is why they are not on by default in AWS Console because it does cost money. I could be wrong though
From the docs: "When group metrics are enabled, Amazon EC2 Auto Scaling sends the following metrics to CloudWatch. The metrics are available at one-minute granularity at no additional charge, but you must enable them."
https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-cloudwatch-monitoring.html
I had the impression AWS was going to do this as seen on this comment:
![Screenshot 2022-11-14 at 11 20 50](https://user-images.githubusercontent.com/4479787/201683708-f7e0edc1-525d-4605-a605-f2306d8a7b3f.png)