kops
kops copied to clipboard
Add Ability to Configure Custom PriorityClasses to Be Used in Cluster Validation
/kind feature
Currently kops validate cluster
checks pods with the priority classes system-cluster-critical
and system-node-critical
. It would be great for users to be able to specify additional priority classes for kops to validate associated pods.
Sounds like a nice feature. Would be happy to review a PR for this.
I will raise you a: Also filter for certain namespaces, because right now all the namespaces are used, if I interpret the code correctly:
https://github.com/kubernetes/kops/blob/ea7df00719357fc9f3d2db515016b2a70f24815b/pkg/validation/validate_cluster.go#L204
This can be quite annoying when other teams deploy their broken "sytem-critical" Pods to the cluster and then the rollout does not work. :|
I will see if i can whip something up, if @tarat44 or nobody else is working on something like this ATM.
Mind, if you allow teams to use those priority classes where not appropriate, you will also run into issues with those pods evicting pods that may be more important.
Well they are not using the priorities for wrong things per se (cluster monitoring, ELK), but it may be a good idea to introduce other prio-classes for that (and it's just our test clusters which are affected, but it's nevertheless annoying if the rollouts do not work because of just faulty pods :D).
Introducing a separate prio-class might be an interim solution for us though. Don't know why I haven't thought about that at first. Thx
I will raise you a: Also filter for certain namespaces, because right now all the namespaces are used, if I interpret the code correctly:
https://github.com/kubernetes/kops/blob/ea7df00719357fc9f3d2db515016b2a70f24815b/pkg/validation/validate_cluster.go#L204
This can be quite annoying when other teams deploy their broken "sytem-critical" Pods to the cluster and then the rollout does not work. :|
I will see if i can whip something up, if @tarat44 or nobody else is working on something like this ATM.
@ederst I have not had time to start work on this so feel free to take it on if the feature is appealing to you too
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
still want to do this, but has low prio; let's see if i can do that:
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Did something #15165
/remove-lifecycle stale
Could we get more information on what you are trying to achieve with this feature? What is the use case? Why are PodDisruptionBudgets not sufficient?
The use case would be to have - for example - a CLI option to provide additional, custom priority classes which kOps takes into consideration when validating the cluster.
This means that kOps will also look at the state of Pods having the provided priority classes, not only those with the currently hard-coded system-cluster/node-critical
priorities.
Additionally, I want to expand this idea to also limit the namespaces, and maybe include labels of the Pods kOps should look at when validating.
This might be beneficial for users with bigger clusters as well, as this could limit the queried Pods.
At least that's what I'm trying here: #15165
You are describing solutions, not use cases there. The question is what does this address that cannot be addressed by PDBs.
Ok, I am confused, point given I described a solution, but I do not know how to connect PodDistruptionBudgets
to the ability of kOps to say a cluster is "valid".
To my understanding PDBs are there to make sure that when evicting pods from nodes there is always a specified amount of pods running to ensure that the service the pods provide will be available.
Whereas, kops validate
and the validation during rolling updates, makes sure that the current state the cluster is in is "OK" according to the parameters kOps checks.
So at least in my mind it makes more sense to argue "Why not improve your monitoring accordingly, rather to rely on the validate
functionality of kOps?", but I might miss something here, or am I not aware of something?
If we look only at the validation step evaluating pod state, the primary use case is about control plane pods. These don't have PDBs and aren't "evicted" as such, but goes down with the node.
We are also using validation on other system pods, but one may say this is because we didn't use PDBs at the time. Today we are, so we technically could drop the prio class validation.
There are three existing primary use cases for cluster validation. One, primarily upon cluster creation, is to wait for a cluster to be able to run workloads. The other two are during rolling update: to know when the control plane is healthy enough to disrupt a control plane node and to know when new nodes in an instance group are able to run workloads so that other nodes in the instance group can be taken out of capacity.
Ok, I think I understand now: Validation should care about the general cluster health and should not care about the other workloads running on it.
This means the role PDBs play in this validation process is a passive one by safeguarding the workloads from eviction during rolling update so that the workloads they are assigned to will stay up and running. Thus, kOps should not care about those workloads in the validation other than not being able to evict a node if the PDBs budget is not met.
Edit (from Slack):
the role of the pdbs is now clearer to me: if there is a problem with new nodes and they cannot run workloads but are considered healthy by the other validations in place, the workloads with PDBs will block the eviction of further nodes down the line
and in light of this I have to think about if adding a user defined way of adapting the validation even makes sense or if there is a use case, but tbh i am more a fan of just getting rid of the prio class validation like @olemarkus mentioned
So for me, I am in favor of getting rid of the "cluster prio" validation and relying on the PDBs.
However, a possible downside could be a workload with say 2 pods running on the first and last node to evict. Because then a possible error (the other pod not being able to start) is only caught when the last node is updated. In a cluster with many nodes this cloud be tedious. On the other hand, testing the upgrade on a dev/test/int cluster should have caught the error before anyway.
So unless @tarat44 can come up with something I am not pursuing this further.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.