containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[EKS] [request]: EKS Control Plane Metrics Available In CloudWatch

Open crhuber opened this issue 4 years ago • 13 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request

In some scenarios it is useful for Kubernetes operators to know the health of the EKS control plane. Some applications or pods may overload the control plane and it can be helpful to know this. Having control plane metrics in cloudwatch such as:

  • apiserverRequestCount
  • apiserverRequestErrCount
  • apiserverLatencyBucket
  • kubeNodes
  • kubePods

can help customers diagnosing slowness or unresponsiveness to the control plane

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Sometimes if the control plane is slow we would like to know if there has been a spike in requests to the API, is there a spike in amount of errors. Did we have a spike in new pods .

Are you currently working around this issue? Scraping the /metrics endpoint on the Kubernetes service

crhuber avatar Mar 17 '20 09:03 crhuber

Hey everyone, I’m a Product Manager for CloudWatch. We are looking for people to join our beta program to provide feedback and test Prometheus metric monitoring in CloudWatch. The beta program will allow you to test the collection of the EKS Control Plane Metrics exposed as Prometheus metrics. Email us if interested, [email protected].

mchene avatar Mar 18 '20 00:03 mchene

Can we include the cluster component status into the CloudWatch as well, for example:

  • kube controller manager
  • scheduler (http://localhost:8001/api/v1/componentstatuses/scheduler)

These can be used to set up CloudWatch alarm when a custom webhook breaks the component, for example, a newly installed ValidatingWebhook that breaks the Scheduler renew lease calls.

starchx avatar Oct 19 '20 05:10 starchx

@starchx - https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-cloudwatch-monitors-prometheus-metrics-container-environments/. You can use CloudWatch Prometheus agent to support above use case. In the first phase (already available), we encourage you to configure the agent to consume Control plane metrics for EKS and leverage CloudWatch alarms. In the second phase, we will also build an automated and out of the box dashboard for EKS Control plane. Check out this workshop to learn more: https://observability.workshop.aws/en/containerinsights/eks/_prometheusmonitoring.html

tpsk-hub avatar Nov 04 '20 17:11 tpsk-hub

I don't mind scraping the endpoints myself since I use Datadog for monitoring but not having access to the schedulers or control plane manger metrics endpoint is tough. For example, without access to the kube scheduler my team and I are unable to track "time to schedule a pod" which is a key service level indicator for us.

https://github.com/DataDog/integrations-core/blob/master/kube_scheduler/datadog_checks/kube_scheduler/kube_scheduler.py#L41

kespinola avatar Nov 20 '20 23:11 kespinola

Since it's been 8 months since @kespinola asked about kube scheduler metrics, checking back on the same here with AWS. Are there any plans on exposing kube-scheduler metrics?

It looks like Container Insights Metrics and the Control Plane Metrics for EKS do not yet expose metrics from kube-scheduler.

rohitkothari avatar Jul 27 '21 20:07 rohitkothari

One of the most important metrics of them all, e2e_scheduling_duration_seconds, is not available. Can we please somehow get access to the scheduler metrics?

frimik avatar Aug 13 '21 19:08 frimik

My team is also trying to fetch and perform analysis on the metrics being reported by kube-scheduler. Can you folks please update us with the proposed timeline for this feature? At least add this component in the feature request since a lot of folks need this as evident from the comments.

PrayagS avatar Feb 23 '22 05:02 PrayagS

+1 for exposing important metrics of control plane, especially kube-scheduler helps us understand the overall scheduling latency, useful especially when we have nodegroups mixed with on-demand, spot instances. the metrics scheduler_pod_scheduling_duration_seconds would be useful in these use cases.

sumanthkumarc avatar Feb 23 '22 05:02 sumanthkumarc

We would love to have kube scheduler metrics available so we can scrape via Prometheus.

yuvraj9 avatar Feb 23 '22 05:02 yuvraj9

We are looking into this, are there any other ones of interest besides the ones mentioned already?

scheduler_pod_scheduling_duration_seconds e2e_scheduling_duration_seconds

mikestef9 avatar Feb 23 '22 17:02 mikestef9

I'm not sure if this is in scope for the EKS control plane metrics. But we currently get all the kube_apiserver.* metrics from EKS into Datadog via a custom helm chart post-install-hook (hook runs a kubectl patch svc/kubernetes command to add datadog annotations which allows our datadog-clusterchecks deployment to grab the metrics from apiserver). But it'd be nice if we could get them natively from cloudwatch instead.

kr3cj avatar Feb 23 '22 18:02 kr3cj

Perhaps also:

  • scheduling_attempt_duration_seconds
  • pod_scheduling_attempts
  • pending_pods
  • scheduling_algorithm_duration_seconds

It may be worth noting that e2e_scheduling_duration_seconds has been replaced by scheduling_attempt_duration_seconds. The former is marked as Alpha status while the latter is considered Stable.


In a slightly different approach than listing individual metric names. Might I suggest that all STABLE metrics be made available?

I'm not sure if Alpha level metrics should have the same treatment. All of the requested metrics except scheduling_algorithm_duration_seconds (which only I have mentioned above) are Stable metrics.

Here is the current list of all Stable metrics:

  • framework_extension_point_duration_seconds
  • pending_pods
  • pod_scheduling_attempts
  • pod_scheduling_duration_seconds
  • preemption_attempts_total
  • preemption_victims
  • queue_incoming_pods_total
  • schedule_attempts_total
  • scheduling_attempt_duration_seconds

And all Alpha metrics:

  • e2e_scheduling_duration_seconds
  • permit_wait_duration_seconds
  • plugin_execution_duration_seconds
  • scheduler_cache_size
  • scheduler_goroutines
  • scheduling_algorithm_duration_seconds

elementalvoid avatar Feb 23 '22 18:02 elementalvoid

@vipin-mohan or others any progress on this? Specifically we need kube scheduler metrics, specifically kube_pod_resource_request

PS k8s docks talk about the scheduler metrics being available at an api endpoint /metrics/resources. EKS talks about /metrcs. Could EKS just expose the scheduler metrics through something similar, via a raw K8s API?

djmcgreal-cc avatar Apr 28 '23 11:04 djmcgreal-cc