autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

A metric to monitor the difference between actual and desired node count

Open shapirus opened this issue 8 months ago • 2 comments

Which component are you using?:

/area cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

We use a lot of AWS EC2 Spot instances in our instance groups.

Sometimes they fail to be re-launched after termination when spot capacity at AWS is insufficient, and this state can last for significant periods of time:

Could not launch Spot Instances. UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration. Please adjust your request and try again. Launching EC2 instance failed.

We also use cluster-autoscaler.

It would be nice if it exported one new metric, e.g., cluster_autoscaler_unfulfilled_node_count{instancegroup="igname"} <count> that would reflect the difference between the currently wanted number of instances and the actual number of k8s nodes currently running in that instance group.

This would allow to create visualizations of periods with low spot capacity at specific AWS availability zones for specific instance group spot request configurations to optimize the cluster configurations. Non-zero values on a respective graph would indicate underprovisioned instance groups.

It would also make it possible to create alerts based on this metric.

Describe any alternative solutions you've considered.:

Of course it's possible to create a DIY tool that would produce such a metric by querying AWS and kubernetes API, but it would be nice to have it in CA, since it already has all the numbers, as far as I understand, and it's only needed to format them and add to the metrics endpoint output.

shapirus avatar Mar 19 '25 18:03 shapirus

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 17 '25 18:06 k8s-triage-robot

/remove-lifecycle stale

shapirus avatar Jun 17 '25 19:06 shapirus

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 15 '25 19:09 k8s-triage-robot

/remove-lifecycle stale

shapirus avatar Sep 16 '25 06:09 shapirus