autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Extract criteria for removing unneded nodes to a separate package

Open x13n opened this issue 2 years ago • 4 comments

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

One more refactor of existing scale down code in order to make parallel scale down (https://github.com/kubernetes/autoscaler/issues/5079) possible.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This is a follow-up to https://github.com/kubernetes/autoscaler/pull/5133. Only last 2 commits should be reviewed as a part of this PR. It will be on hold until the previous one is merged.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


x13n avatar Aug 30 '22 09:08 x13n

/hold /assign @MaciekPytel /assign @yaroslava-serdiuk

x13n avatar Aug 30 '22 09:08 x13n

/hold cancel

x13n avatar Sep 30 '22 14:09 x13n

I think this will work and I'm going to block it anymore, but I don't really like this refactor for the reasons I listed in comments above. I'm leaving hold so you can have a chance to read my comments and decide if you want to address them or keep the current version. Please feel free to remove the hold.

/lgtm /approve /hold

MaciekPytel avatar Oct 14 '22 18:10 MaciekPytel

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Oct 14 '22 18:10 k8s-ci-robot

I understand where the concern is coming from, but I disagree caching information is wrong in this case. Update / RemovableAt semantics are well defined: Update updates Nodes object with the last known state of the world, while RemovableAt returns the list of nodes that are removable at a certain timestamp, based on past Update calls. This is going to hold in any implementation that understands such semantics. I don't think Nodes object should be aware of autoscaler loops, pods or nodes being filtered out or anything that happens beyond this object. It's the other way around: if someone is trying to use Nodes anywhere in the code, they need to make sure they are using it properly. Omitting an Update call is not a problem of the object: it is a problem of the caller: if they didn't provide fresh information, the list of removable nodes may be inaccurate. Note that it will carries the risk of being inaccurate regardless of when the call happened: CA is making in-memory decisions based on snapshotted state of a distributed system. For its decisions to be accurate, we would have to freeze all changes in the system during the decision making, which is infeasible, for obvious reasons. That being said, I agree the code could be improved (it always can), but I'd like to merge it as is: it is a strict improvement over the previous state, it already enables further changes to implement parallel drain and it has been hanging for over 1.5 month now.

/hold cancel

x13n avatar Oct 17 '22 18:10 x13n