containers-roadmap [EKS] Managed NodeGroups: Enable Group Metrics Collection for created ASG

[EKS] Managed NodeGroups: Enable Group Metrics Collection for created ASG

Open YaraMohammed opened this issue 4 years ago • 45 comments

Request Add an option in the managed node groups to enable Group Metrics Collection for the created ASG

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I'm trying to collect more metrics to have a good overview on the instances in service and keep track of the recreated nodes

Are you currently working around this issue? We enable the metrics collection for the groups manually after they are created

Description The managed node groups create an ASG which is fully managed by the node groups and have Group Metrics Collection disabled by default. This is to enable more enhanced monitoring

Feb 20 '20 13:02 YaraMohammed

I come across this proposal when I try to enable metrics for node_groups when using terraform-aws-modules/eks/aws. I really think this is a feature we much need.

Aug 10 '20 18:08 amazingandyyy

Any update on this issue?

Aug 14 '20 15:08 YaodanZhang

[updates] we filed a support ticket with AWS on this, and they suggest us to add voice to this thread, and turn on node group metrics manually as a workaround :/

Aug 17 '20 18:08 amazingandyyy

Same suggestion from AWS, adding a +1 here to try influencing that roadmap.

Jan 29 '21 17:01 javs-perez

The ASG backing a managed node group is meant to be more of an implementation detail. I realize there is no charge for enabling this, so it is something we could do, but I'd like also to hear more details about what problems you are trying to solve that enabling ASG metrics would help with, that can't be currently solved by more Kubernetes native metrics options like Container Insights or Prometheus.

Jan 29 '21 18:01 mikestef9

Hey, @mikestef9, thanks for the quick follow-up. We just started using EKS so maybe there is a better way of doing this.

Currently, we create a NodeGroup with ScalingConfig that has MinSize and a MaxSize. We ran into an issue not too long ago, where the number of healthy nodes went below the MinSize for a few mins. If this happened in the future we wanted to alert on it. We use Datadog, and we could create an alert where if healthy nodes are less than let's say 10, alert us. We wanted to make the alert more dynamic, and get the actual MinSize of the ASG. In case we change it in the future we don't have to change the alert.

Do you think there is a better way of achieving this alert? maybe this type of alert is not very useful when we are talking about EKS?

edit: To allow DataDog to collect ASG metrics, we have to enable MetricsCollection in the ASGs we want to monitor.

Jan 29 '21 20:01 javs-perez

Similar to what @javs-perez has mentioned, we are using Datadog and wish to alert on capacity e.g. % of running nodes out of the max size.

We had a problem where our cluster autoscaler had scaled to max capacity set for the managed node group, so we had pending pods due to insufficient resources. We can remedy this via pending pods potentially, but having these metrics would certainly be beneficial.

Mar 26 '21 11:03 HenryCook

I have created a script to automate this on my CI/CD pipeline. It only uses awscli and jq . So someone might benefit.. https://gist.github.com/cdalar/f5749040ccb7487203738a134767e3fc

Note: change it according to your need like --regions etc.

# Get's the FIRST Cluster on list-clusters. Assuming you only have 1 EKS 
EKS_CLUSTER_NAME=$(aws eks list-clusters --region=eu-central-1 | jq -r .clusters[0])
echo $EKS_CLUSTER_NAME
# First NodeGroup from the list.
NG=$(aws eks list-nodegroups --cluster-name $EKS_CLUSTER_NAME | jq -r '.nodegroups[0]')
echo $NG
# First Autoscaling Group Name
ASG_NAME=$(aws eks describe-nodegroup --cluster-name $EKS_CLUSTER_NAME --nodegroup-name $NG | jq -r '.nodegroup.resources.autoScalingGroups[0].name')

# Enable Autoscaling Group Metrics
aws autoscaling enable-metrics-collection --auto-scaling-group-name $ASG_NAME --granularity "1Minute"


# --------- Extra ---------- 
# Get SNS Topic ARN for Alarms.
SNS_ARN=$(aws sns list-topics | jq -r '.Topics[0].TopicArn')
# EKS Autoscaling Capacity Alarm
EKS_ASG_MAX_SIZE=$(aws cloudformation describe-stacks | jq -r --arg EKS_CLUSTER_NAME "$EKS_CLUSTER_NAME" '.Stacks[] | select( .StackName == $EKS_CLUSTER_NAME+"-eks-nodegroup")' | jq -r '.Parameters[] | select(.ParameterKey == "EksAsgMaxSize") | .ParameterValue')
aws cloudwatch put-metric-alarm --alarm-name "${EKS_CLUSTER_NAME}-EKS NodeGroup EksAsgCapacityAlarm" --evaluation-periods 1 --comparison-operator GreaterThanOrEqualToThreshold --metric-name GroupTotalInstances --period 600 --namespace AWS/AutoScaling --statistic Maximum --threshold $EKS_ASG_MAX_SIZE --dimensions Name=AutoScalingGroupName,Value=$ASG_NAME --ok-actions $SNS_ARN --alarm-actions $SNS_ARN

Mar 31 '21 17:03 cdalar

Any update on this issue? I'm hoping that AWS Managed Services will provide a seamless integration. I'm sick of manually manipulating it.

Jun 16 '21 03:06 nanasi880

Hi @mikestef9,

I will describe my use case. We use Datadog to monitor Kubernetes/EKS etc... In most cases, yes, you can use other Kubernetes metrics without depending on the ASG.

But there's a case where it's really useful. Imagine you scale your ASG to zero, or delete the node group. What happens in that case is that the Datadog DaemonSet (or cloudwatch container insights) will be uninstalled (there's no more nodes available). That way you stop receiving metrics from K8S and no longer know if you have nodes running or not.

With ASG metrics available, we can catch this case by monitoring the ASG metrics for running instances etc... Those won't stop as they come from the AWS integration of Datadog.

Also, if the ASG metrics are free, why not enable them by default? Will it cause any issue to anyone? I guess not. So maybe there's no need to provide an option to enable/disable.

Just enable it by default! :)

Oct 27 '21 14:10 michelzanini

@mikestef9 in my use case we also use DataDog for monitoring and we have alerts for when the ASG is at or near max capacity. I'm not sure of a way to track that directly against the EKS Managed Node Group resources and as far as I know CloudWatch doesn't have any metrics for EKS?

Oct 27 '21 22:10 orirawlings

Moved to in progress. We are going to enable this flag for newly created managed node groups. Follow this issue for further updates.

Nov 01 '21 18:11 mikestef9

import boto3

eks = boto3.client('eks')
autoscaling = boto3.client('autoscaling')

clusters = eks.list_clusters()['clusters']
for cluster in clusters:
    print(f'cluster: {cluster}')
    nodegroups=eks.list_nodegroups(clusterName=cluster)["nodegroups"]
    for nodegroup in nodegroups:
        print(f'*nodegroup: {nodegroup}')
        autoScalingGroups = eks.describe_nodegroup(clusterName=cluster,nodegroupName=nodegroup)["nodegroup"]["resources"]["autoScalingGroups"]
        for autoScalingGroup in autoScalingGroups:
            print(f'##autoScalingGroup: {autoScalingGroup["name"]}')
            metricsResult = autoscaling.enable_metrics_collection(AutoScalingGroupName=autoScalingGroup["name"],Granularity="1Minute")
            print(f'@@@metricsResult: {metricsResult["ResponseMetadata"]["HTTPStatusCode"]}')

pyhton script to activate metrics on all asgs.

Jan 18 '22 08:01 nahum-litvin-hs

Hey, any news on this issue?

May 03 '22 12:05 woernfl

Any updates?

May 12 '22 07:05 samuelbaena

Any Updates on this ?

Jul 20 '22 08:07 rpsadarangani

Any Updates on this ?

Sep 28 '22 14:09 yasinlachiny

Is there update re this case ?

Oct 04 '22 06:10 sebastian-bugajny

Is there any update on this ? Our organization also needed this feature

Oct 16 '22 17:10 akash123-eng

We also need this. Any update?

Oct 17 '22 06:10 vishnu-anil

+1 for this, it would be great to be able to configure this for managed node groups!

Oct 19 '22 19:10 sebas-w

hey @mikestef9, do we have any updates on the progress of the implementation of this ?

Nov 08 '22 14:11 gauravkohli

There is a new blog published today to enable this functionality using EventBridge and Lambda: https://aws.amazon.com/blogs/containers/automatically-enable-group-metrics-collection-for-amazon-eks-managed-node-groups/

Nov 12 '22 04:11 aaroniscode

In all seriousness, this is not a solution people are asking for. This is temporary workaround and offloads feature implementation to customer, while everyone here are expecting this to be managed (as in Managed Node Group) solution from aws.

Nov 12 '22 15:11 z0rc

I agree. Ideally, enabling auto scaling group metrics would be exposed as a field in the MNG API.

Nov 12 '22 21:11 orirawlings

Yeah I would have liked this to be exposed in the aws terraform provider when creating a aws_eks_node_group It feels like it should just be another attribute to pass in true false etc

Nov 14 '22 13:11 lorelei-rupp-imprivata

If exposing an attribute takes too long to implement, why not change the behaviour to default to true since this metrics are free. That way later on you can add an attribute for people that want to disable it.... I would image this should be a very quick implementation to do.

Nov 14 '22 13:11 michelzanini

If exposing an attribute takes too long to implement, why not change the behaviour to default to true since this metrics are free. That way later on you can add an attribute for people that want to disable it.... I would image this should be a very quick implementation to do.

I dont think its free. They are ingested into cloudwatch and you still pay for it. I think that is why they are not on by default in AWS Console because it does cost money. I could be wrong though

Nov 14 '22 14:11 lorelei-rupp-imprivata

From the docs: "When group metrics are enabled, Amazon EC2 Auto Scaling sends the following metrics to CloudWatch. The metrics are available at one-minute granularity at no additional charge, but you must enable them."

https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-cloudwatch-monitoring.html

Nov 14 '22 14:11 michelzanini

I had the impression AWS was going to do this as seen on this comment:

Nov 14 '22 14:11 michelzanini

containers-roadmap containers-roadmap copied to clipboard

[EKS] Managed NodeGroups: Enable Group Metrics Collection for created ASG

containers-roadmap
containers-roadmap copied to clipboard