kops Add Ability to Configure Custom PriorityClasses to Be Used in Cluster Validation

/kind feature

Currently kops validate cluster checks pods with the priority classes system-cluster-critical and system-node-critical. It would be great for users to be able to specify additional priority classes for kops to validate associated pods.

Jan 25 '22 19:01 tarat44

Sounds like a nice feature. Would be happy to review a PR for this.

Jan 25 '22 20:01 olemarkus

I will raise you a: Also filter for certain namespaces, because right now all the namespaces are used, if I interpret the code correctly:

https://github.com/kubernetes/kops/blob/ea7df00719357fc9f3d2db515016b2a70f24815b/pkg/validation/validate_cluster.go#L204

This can be quite annoying when other teams deploy their broken "sytem-critical" Pods to the cluster and then the rollout does not work. :|

I will see if i can whip something up, if @tarat44 or nobody else is working on something like this ATM.

Apr 11 '22 08:04 ederst

Mind, if you allow teams to use those priority classes where not appropriate, you will also run into issues with those pods evicting pods that may be more important.

Apr 11 '22 08:04 olemarkus

Well they are not using the priorities for wrong things per se (cluster monitoring, ELK), but it may be a good idea to introduce other prio-classes for that (and it's just our test clusters which are affected, but it's nevertheless annoying if the rollouts do not work because of just faulty pods :D).

Introducing a separate prio-class might be an interim solution for us though. Don't know why I haven't thought about that at first. Thx

Apr 11 '22 09:04 ederst

I will raise you a: Also filter for certain namespaces, because right now all the namespaces are used, if I interpret the code correctly:

https://github.com/kubernetes/kops/blob/ea7df00719357fc9f3d2db515016b2a70f24815b/pkg/validation/validate_cluster.go#L204

This can be quite annoying when other teams deploy their broken "sytem-critical" Pods to the cluster and then the rollout does not work. :|

I will see if i can whip something up, if @tarat44 or nobody else is working on something like this ATM.

@ederst I have not had time to start work on this so feel free to take it on if the feature is appealing to you too

Apr 13 '22 16:04 tarat44

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 12 '22 16:07 k8s-triage-robot

still want to do this, but has low prio; let's see if i can do that:

/remove-lifecycle stale

Jul 21 '22 18:07 ederst

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 19 '22 18:10 k8s-triage-robot

/remove-lifecycle stale

Nov 13 '22 10:11 ederst

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 11 '23 10:02 k8s-triage-robot

Did something #15165

/remove-lifecycle stale

Feb 16 '23 14:02 ederst

Could we get more information on what you are trying to achieve with this feature? What is the use case? Why are PodDisruptionBudgets not sufficient?

Feb 24 '23 17:02 johngmyers

The use case would be to have - for example - a CLI option to provide additional, custom priority classes which kOps takes into consideration when validating the cluster.

This means that kOps will also look at the state of Pods having the provided priority classes, not only those with the currently hard-coded system-cluster/node-critical priorities.

Additionally, I want to expand this idea to also limit the namespaces, and maybe include labels of the Pods kOps should look at when validating.

This might be beneficial for users with bigger clusters as well, as this could limit the queried Pods.

At least that's what I'm trying here: #15165

Feb 25 '23 19:02 ederst

You are describing solutions, not use cases there. The question is what does this address that cannot be addressed by PDBs.

Feb 27 '23 08:02 olemarkus

Ok, I am confused, point given I described a solution, but I do not know how to connect PodDistruptionBudgets to the ability of kOps to say a cluster is "valid".

To my understanding PDBs are there to make sure that when evicting pods from nodes there is always a specified amount of pods running to ensure that the service the pods provide will be available.

Whereas, kops validate and the validation during rolling updates, makes sure that the current state the cluster is in is "OK" according to the parameters kOps checks.

So at least in my mind it makes more sense to argue "Why not improve your monitoring accordingly, rather to rely on the validate functionality of kOps?", but I might miss something here, or am I not aware of something?

Feb 27 '23 09:02 ederst

If we look only at the validation step evaluating pod state, the primary use case is about control plane pods. These don't have PDBs and aren't "evicted" as such, but goes down with the node.

We are also using validation on other system pods, but one may say this is because we didn't use PDBs at the time. Today we are, so we technically could drop the prio class validation.

Feb 27 '23 10:02 olemarkus

There are three existing primary use cases for cluster validation. One, primarily upon cluster creation, is to wait for a cluster to be able to run workloads. The other two are during rolling update: to know when the control plane is healthy enough to disrupt a control plane node and to know when new nodes in an instance group are able to run workloads so that other nodes in the instance group can be taken out of capacity.

Feb 27 '23 16:02 johngmyers

Ok, I think I understand now: Validation should care about the general cluster health and should not care about the other workloads running on it.

This means the role PDBs play in this validation process is a passive one by safeguarding the workloads from eviction during rolling update so that the workloads they are assigned to will stay up and running. Thus, kOps should not care about those workloads in the validation other than not being able to evict a node if the PDBs budget is not met.

Edit (from Slack):

the role of the pdbs is now clearer to me: if there is a problem with new nodes and they cannot run workloads but are considered healthy by the other validations in place, the workloads with PDBs will block the eviction of further nodes down the line

and in light of this I have to think about if adding a user defined way of adapting the validation even makes sense or if there is a use case, but tbh i am more a fan of just getting rid of the prio class validation like @olemarkus mentioned

Feb 28 '23 08:02 ederst

So for me, I am in favor of getting rid of the "cluster prio" validation and relying on the PDBs.

However, a possible downside could be a workload with say 2 pods running on the first and last node to evict. Because then a possible error (the other pod not being able to start) is only caught when the last node is updated. In a cluster with many nodes this cloud be tedious. On the other hand, testing the upgrade on a dev/test/int cluster should have caught the error before anyway.

So unless @tarat44 can come up with something I am not pursuing this further.

Mar 14 '23 13:03 ederst

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 12 '23 13:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 12 '23 13:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jan 19 '24 18:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 19 '24 18:01 k8s-ci-robot

kops kops copied to clipboard

Add Ability to Configure Custom PriorityClasses to Be Used in Cluster Validation

kops
kops copied to clipboard