autoscaler
autoscaler copied to clipboard
A metric to monitor the difference between actual and desired node count
Which component are you using?:
/area cluster-autoscaler
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
We use a lot of AWS EC2 Spot instances in our instance groups.
Sometimes they fail to be re-launched after termination when spot capacity at AWS is insufficient, and this state can last for significant periods of time:
Could not launch Spot Instances. UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration. Please adjust your request and try again. Launching EC2 instance failed.
We also use cluster-autoscaler.
It would be nice if it exported one new metric, e.g., cluster_autoscaler_unfulfilled_node_count{instancegroup="igname"} <count> that would reflect the difference between the currently wanted number of instances and the actual number of k8s nodes currently running in that instance group.
This would allow to create visualizations of periods with low spot capacity at specific AWS availability zones for specific instance group spot request configurations to optimize the cluster configurations. Non-zero values on a respective graph would indicate underprovisioned instance groups.
It would also make it possible to create alerts based on this metric.
Describe any alternative solutions you've considered.:
Of course it's possible to create a DIY tool that would produce such a metric by querying AWS and kubernetes API, but it would be nice to have it in CA, since it already has all the numbers, as far as I understand, and it's only needed to format them and add to the metrics endpoint output.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale