autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

[cluster-autoscaler] More quickly mark spot ASG in AWS as unavailable if InsufficientInstanceCapacity

Open cep21 opened this issue 5 years ago • 33 comments

I have two ASG: a spot and on-demand ASG. They are GPU nodes, so frequently spot instances aren't available. AWS tells us very quickly that a spot instance is unavailable: we can see "Could not launch Spot Instances. InsufficientInstanceCapacity - There is no Spot capacity available that matches your request. Launching EC2 instance failed" in the ASG logs.

The current behavior is that autoscaler tries to use the spot ASG for 15 minutes (my current timeout) before it gives up and tries to use a non spot ASG. Ideally, it could notice that the reason the ASG did not scale up, InsufficientInstanceCapacity, is unlikely to go away in the next 15 minutes and would instead mark that group as unable to scale up and fall back to the on-demand ASG.

cep21 avatar Jun 24 '20 20:06 cep21

Having the same issue here.

https://github.com/kubernetes/autoscaler/blob/852ea800914cae101824687a71236f7688ee653d/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go#L220

SetDesiredCapacity will not return any error related to InsufficientInstanceCapacity according to its doc. We might need to check the scaling activities by calling DescribeScalingActivities.

{
    "Activities": [
        {
            "ActivityId": "ee05cf07-241b-2f28-2be4-3b60f77a76e9",
            "AutoScalingGroupName": "nodes-gpu-spot-cn-north-1a.aws-cn-north-1.prod-1.k8s.local",
            "Description": "Launching a new EC2 instance.  Status Reason: There is no Spot capacity available that matches your request. Launching EC2 instance failed.",
            "Cause": "At 2020-08-06T03:20:39Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.",
            "StartTime": "2020-08-06T03:20:43.979Z",
            "EndTime": "2020-08-06T03:20:43Z",
            "StatusCode": "Failed",
            "StatusMessage": "There is no Spot capacity available that matches your request. Launching EC2 instance failed.",
            "Progress": 100,
            "Details": "{\"Subnet ID\":\"subnet-5d6fb339\",\"Availability Zone\":\"cn-north-1a\"}"
        },
        ...
    ]
}

qqshfox avatar Aug 06 '20 03:08 qqshfox

I think the title of this issue should be amended to include other holding states. For example, I'm running into a similar issue with price-too-low. If the maximum spot price for my ASGs is below the current spot prices, cluster-autoscaler waits quite a while before it attempts to use a non-spot ASG.

JacobHenner avatar Sep 18 '20 18:09 JacobHenner

It's not just spot. Another example is you can hit your account limit on number of instances of a specific instance type: that will also not likely change in the next 15 minutes and it's best to try another ASG.

A general understanding of failure states that are unlikely to change could be very helpful.

cep21 avatar Sep 18 '20 18:09 cep21

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Dec 17 '20 19:12 fejta-bot

Super important! /remove-lifecycle stale

cep21 avatar Dec 18 '20 20:12 cep21

Looking at AWS API, it seems like there is no reliable way to find out that scaling out for particular SetDesiredCapacity call has failed. If SetDesiredCapacity returned ActivityId for scaling activity, that would work. Otherwise - personally I can't come up with nothing better than parsing autoscaling activities "younger" than mySetDesiderCapacity API call. Don't feel like this way is production-ready. Any better ideas?

klebediev avatar Dec 21 '20 14:12 klebediev

I wouldn't expect anything that ties back to a single SetDesiredCapacity since it's async and there could be multiple calls.

parsing autoscaling activities "younger" than mySetDesiderCapacity API call

Maybe look at the last activity (rather than them all), if it's recent (for some definition of recent), then assume the capacity isn't able to change right now and quick fallover any scaling operation.

cep21 avatar Dec 22 '20 16:12 cep21

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Mar 22 '21 17:03 fejta-bot

Super important! /remove-lifecycle stale

cep21 avatar Mar 22 '21 18:03 cep21

This is important for us too, same use case as OP.

itssimon avatar May 03 '21 14:05 itssimon

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

k8s-triage-robot avatar Aug 01 '21 14:08 k8s-triage-robot

/remove-lifecycle stale

azhurbilo avatar Aug 01 '21 18:08 azhurbilo

Any updates regarding this? It's super important for us and I'm sure for many others. Also, where this magic number "15 min" is set? Is it configurable?

orsher avatar Aug 04 '21 05:08 orsher

I think the 15 Minutes magic number is set by "--max-node-provision-time". For sure it would be better and a nice feature to scan the scaling events and mark the ASG instantly as dead for next x minutes.

atze234 avatar Nov 01 '21 10:11 atze234

what if we improve detection of "ASG can't be scaled up activity" by sending notifications Fails to launch to SNS topic like:

 $ aws autoscaling put-notification-configuration --auto-scaling-group-name <value> --topic-arn <value> --notification-types "autoscaling:EC2_INSTANCE_LAUNCH_ERROR"

then we can subscribe SQS queue to this topic and cluster-autoscaler can start polling this SQS queue after initiating "scale up" activity.

At this approach requires some configuration effort, it should be disabled by default => but for use cases when fast detection of Fails to launch is useful like with spot ASGs users can configure corresponding infrastructure (SNS, SQS, ASG notifications) and enable this "fail fast" detection method.

klebediev avatar Nov 13 '21 09:11 klebediev

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 11 '22 09:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 13 '22 10:03 k8s-triage-robot

/remove-lifecycle rotten

itssimon avatar Mar 13 '22 11:03 itssimon

with recent changes making use of eks:DescribeNodegroup https://github.com/kubernetes/autoscaler/commit/b4cadfb4e25b6660c41dbe2b73e66e9a2f3a2cc9 can we use health information from nodegroup https://docs.aws.amazon.com/cli/latest/reference/eks/describe-nodegroup.html Unhealthly node group should be excluded from calculation

piotrwielgolaski-tomtom avatar May 31 '22 08:05 piotrwielgolaski-tomtom

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 29 '22 09:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 28 '22 09:09 k8s-triage-robot

/remove-lifecycle rotten

miadabrin avatar Sep 29 '22 20:09 miadabrin

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 28 '22 20:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 27 '23 21:01 k8s-triage-robot

/remove-lifecycle rotten

theintz avatar Feb 02 '23 10:02 theintz

We are using an expander "priority" in our autoscaler config which doesn't solve this case. If there is a rebalance recommendation done on ASG, [Having 2 AZ's] sometimes SPOT is unavailable in 1 AZ but, it doesn't fallback to ON_Demand Node_Group. Is there a way we can achieve the fallback to happen on On_demand in someway?

decipher27 avatar Mar 06 '23 06:03 decipher27

Any updates on the fix for this case?

decipher27 avatar Mar 21 '23 13:03 decipher27

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 19 '23 13:06 k8s-triage-robot

/remove-lifecycle rotten

RamazanKara avatar Jun 27 '23 16:06 RamazanKara

Or at least workaround? I can verify also it's not just spot. We're getting the same issue with a k8s cluster running on regular ec2 instances. We currently have 3 autoscaling groups that are using us-east-2a, us-east-2b, and us-east-2c that are stuck bouncing back and forth between max and max-1 because a zone rebalancing failed based on capacity in that zone.

ntkach avatar Jun 28 '23 15:06 ntkach