autoscaler Extended resources provided by ASG via tags is not working

Which component are you using?: autoscaler

What version of the component are you using?: 1.25.0-alpha.0 AND 1.23.1

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-04-13T19:57:43Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.9", GitCommit:"c1de2d70269039fe55efb98e737d9a29f9155246", GitTreeState:"clean", BuildDate:"2022-07-13T14:19:57Z", GoVersion:"go1.17.11", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: sanbox ( AWS )

What did you expect to happen?: I tried to use extended resource defined as tag in ASG according documentation for AWS it should be k8s.io/cluster-autoscaler/node-template/resources/<resource-name> https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.23.1/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup . This sould work at least for node-group-auto-discovery mode. Is anybody successfully using it?

What happened instead?: This tag is never read or ignored. CA is still complaining about insufficient resource and not scaling up.

How to reproduce it (as minimally and precisely as possible): Add extended resource as descirbe in https://kubernetes.io/docs/tasks/configure-pod-container/extended-resource/ Add k8s.io/cluster-autoscaler/node-template/resources/<resource-name> tag with same reasonable value to ASG

Anything else we need to know?: I did some tests, it looks like this is never executed https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L412

Sep 05 '22 10:09 tombokombo

We're seeing the same result in our end, but for labels k8s.io/cluster-autoscaler/node-template/labels/<labels-name>. During an upgrade from 1.22 to 1.23 on EKS, and our Cluster Autoscaler 1.23 has the same results, but on 1.22 we didn't have this problem

The only thing that I see that changed here, may be PR #4238 that plays around with labels? I never checked this code in depth before, but the extractAutoscalingOptionsFromTags has a different approach to the usual. Maybe that breaks something? Just posting in case it helps whoever takes this

Sep 06 '22 20:09 ZimmSebas

@ZimmSebas sound similar but its kind of different bug, yours could be related to https://github.com/kubernetes/autoscaler/pull/4238

Bug that I'm describing is more complex. If you put eg custom-resouce: 2 to pods requests/limits, scaling up end up here https://github.com/kubernetes/autoscaler/blob/c38cc7460426b80ad60e63b7647f2c973a4e3878/cluster-autoscaler/core/scale_up.go#L463 as inside computeExpansionOption() predicates will fail https://github.com/kubernetes/autoscaler/blob/c38cc7460426b80ad60e63b7647f2c973a4e3878/cluster-autoscaler/core/scale_up.go#L446 with predicate checking error: Insufficient custom-resouce; so it will never reach https://github.com/kubernetes/autoscaler/blob/c38cc7460426b80ad60e63b7647f2c973a4e3878/cluster-autoscaler/core/scale_up.go#L509 func which in the end tries to extract k8s.io/cluster-autoscaler/node-template/resources annotation

Sep 07 '22 21:09 tombokombo

I spent a little while trying to track this down and couldn't figure out how to repro. I know we are using k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage on our clusters and it is working as expected. We're also on a slightly older version of cluster autoscaler. Is this a regression in behaviour, or has this always been broken? I'm not sure.

Oct 17 '22 16:10 drmorr0

@drmorr0 which version and which cloudprovider?

Oct 26 '22 20:10 tombokombo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 24 '23 21:01 k8s-triage-robot

autoscaler autoscaler copied to clipboard

Extended resources provided by ASG via tags is not working

autoscaler
autoscaler copied to clipboard