autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Cluster autoscaler should remove unused "node-autoprovisioning-enabled" flag and the related metrics

Open elmiko opened this issue 2 years ago • 12 comments

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: 1.28.0

What k8s version are you using (kubectl version)?:

1.28.2

What environment is this in?:

this affects all environments

What did you expect to happen?:

I expect the node-autoprovisioning-enabled flag to change the behavior of the cluster-autoscaler when it is set or unset.

What happened instead?:

Changing this flag has no effect as it is essentially a left over from an earlier version of the autoscaler.

How to reproduce it (as minimally and precisely as possible):

Run the autoscaler with --node-autoprovisioning-enabled and without, there is no change in the behavior of the autoscaler.

Anything else we need to know?:

This issue was discussed during the SIG meeting from 2023-10-23, and agree that removal is safe and the function is no longer in use.

The node autoprovisioning metrics should also be removed as part of this effort, see https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/metrics/metrics.go#L352-L375

/good-first-issue /help

elmiko avatar Oct 26 '23 18:10 elmiko

@elmiko: This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: 1.28.0

What k8s version are you using (kubectl version)?:

1.28.2

What environment is this in?:

this affects all environments

What did you expect to happen?:

I expect the node-autoprovisioning-enabled flag to change the behavior of the cluster-autoscaler when it is set or unset.

What happened instead?:

Changing this flag has no effect as it is essentially a left over from an earlier version of the autoscaler.

How to reproduce it (as minimally and precisely as possible):

Run the autoscaler with --node-autoprovisioning-enabled and without, there is no change in the behavior of the autoscaler.

Anything else we need to know?:

This issue was discussed during the SIG meeting from 2023-10-23, and agree that removal is safe and the function is no longer in use.

The node autoprovisioning metrics should also be removed as part of this effort, see https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/metrics/metrics.go#L352-L375

/good-first-issue /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 26 '23 18:10 k8s-ci-robot

if any new contributors are interested in picking this issue up, please feel free to reach out on slack (@elmiko), i'm happy to give guidance.

elmiko avatar Oct 26 '23 21:10 elmiko

I am a beginner. I want to do this issue

prashantrewar avatar Oct 27 '23 05:10 prashantrewar

great @prashantrewar , happy to help.

in my initial analysis, we currently have 2 command line flags which appear to no longer be in use: --node-autoprovisioning-enabled, and --max-autoprovisioned-node-group-count.

the flags are created here: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/main.go#L193-L194

there are a few task which need to be done:

  1. investigate these flags to ensure they are not being used in the production code (not tests)
  2. investigate the related metics to ensure that they are not used.
  3. if you can confirm they are not being used, we should remove nearly all the code for the flags and metrics.
  4. we need to keep the command line flags available, but they should print a warning that they are deprecated.

in the SIG meeting we agreed that we should probably follow the deprecation guidelines for these flags. this means that we need to continue to accept the flags for 12 months or 2 releases. this is why we want to print a warning when the flags are used, but otherwise remove the code behind them as it appears to do nothing.

when this is resolved we need to make sure that the release notes contain a message about the deprecations, that will come towards the end when you are merging patches.

please let me know if you would like more details on any of those points, and thanks again for volunteering!

elmiko avatar Oct 27 '23 13:10 elmiko

/assign @prashantrewar

elmiko avatar Oct 27 '23 13:10 elmiko

/triage accepted

Shubham82 avatar Nov 28 '23 11:11 Shubham82

This issue can be closed. Fixed by #6246

Vandit1604 avatar Jan 07 '24 01:01 Vandit1604

Closing this issue as it is resolved by PR #6246.

Shubham82 avatar Jan 08 '24 05:01 Shubham82

/close

Shubham82 avatar Jan 08 '24 05:01 Shubham82

@Shubham82: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 08 '24 05:01 k8s-ci-robot

I don't think #6246 fixed the issue, as it only deprecated the flags and did nothing about the metrics.

I fully agree with removing the flags and the nap_enabled metric. Having them there but not changing behavior doesn't make much sense.

I'm less convinced about removing the created_node_groups_total and deleted_node_groups_total metrics. We do have the NodeGroupManager interface which allows downstream CA implementations to create and remove node groups during the CA loop. GKE Cluster Autoscaler uses NodeGroupManager heavily, not sure if we have other downstream implementations. In any case, the concept of "creating a node group" and "removing a node group" is present and needed in core CA, even if none of the cloud providers in the core repo take advantage of it. So if CA provides a way to create and remove node groups, it makes sense to me that it also provides some auxiliary things like metrics, that the implementations can reuse.

On the other hand, are there any downsides to leaving the 2 metrics in place? Maybe we could stop registering them by default if possible confusion is the problem?

@elmiko WDYT?

towca avatar Mar 13 '24 17:03 towca

really good points @towca , if there is a possibility for users to supplement the autoscaler and use those metrics, then i think we should keep them and document this usage.

On the other hand, are there any downsides to leaving the 2 metrics in place? Maybe we could stop registering them by default if possible confusion is the problem?

probably not, but we should still document about what is happening in that case.

elmiko avatar Mar 14 '24 16:03 elmiko