Implicit tolerations
Enhancement Description
Administrators often taint nodes with high-value resources like GPUs, to avoid them being consumed by workloads that do not need them. To simplify the user experience, some platforms (e.g., GKE) run a webhook to automatically tolerate those taints, if the pods have extended resource requests for those resources. This ensures that pods still run even if the user forgets to add the toleration, but only for those pods that actually need it.
With the advent of DRA, the exact needs of the workload are no longer determinable simply by looking at the PodSpec during API admission. Instead, the resource claims and device classes must also be examined. Additionally, the optionality available in DRA resource claim APIs may mean that several different types of nodes/resources (and therefore several different types of tolerations) are needed. A webhook does not have access to all the information it would need to add the tolerations at API admission time.
We discussed adding a "high value resource" aspect to node capabilities, but after further discussion it's not clear that's the right way to solve this problem. This enhancement request provides an alternative approach.
In this approach, we create a new scheduler plugin (or update the existing taints & tolerations plugin), which can be configured to examine the PodSpec and all associated Resource Claims and DeviceClasses at scheduling time and, based on the needs of the workload, implicitly tolerate taints. Essentially, we move the behavior of the web hook from API server admission time, to Pod scheduling time. This allows all necessary information to be available.
The specific way to calculate the tolerations, and the taints which they will tolerate will likely need to be part of the configuration of the scheduler plugin, since it is not known upstream what those taints are and when/how they should be tolerated.
This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.
/cc @pohly @klueska @pravk03 @dom4ha @dchen1107 /sig scheduling /wg device-management
- One-line enhancement description (can be used as a release note): Enable configuration of the scheduler to implicitly tolerate taints based on data found in the PodSpec, Resource Claims, and Device Classes
- Kubernetes Enhancement Proposal: TBD
- Discussion Link:
- https://github.com/kubernetes-sigs/scheduler-plugins/pull/812#issuecomment-2689201925
- https://github.com/kubernetes/community/pull/8396#discussion_r2058850619
- Primary contact (assignee): @johnbelamaric
- Responsible SIGs: Scheduling
- Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.34
- Beta release target (x.y):
- Stable release target (x.y):
- [ ] Alpha
- [ ] KEP (
k/enhancements) update PR(s): https://github.com/kubernetes/enhancements/pull/5389 - [ ] Code (
k/k) update PR(s): - [ ] Docs (
k/website) update PR(s):
- [ ] KEP (
Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
/cc
I see the usecase for the devices or non-core resources. I'm thinking especially about exclusive CPUs, but memory/hugepages can have the same problem. IIUC core resources (cpu, memory) are out of scope of this specific KEP and they are expected to be handled as part of the node capabilities KEP, right?
The idea here is just to build on top of taints/tolerations for the "repel" use case. The "attract"/"constrain" use case - the automatic equivalent of label selection, basically - is not really covered here and would be part of the capabilities concept.
This KEP doesn't look at specifics, I think it should provide a framework that cluster architects can use to configure the scheduler to do what they want. I imagine a list of rules that could be defined for the scheduler, that run CEL expressions against PodSpec, Resource Claims, and Device Classes. If the result is "true", the pod is scheduled as if the user put a specific toleration on it. For example, the plugin could accept a config with a list of data structures like:
type ImplicitTolerationRule struct {
Expression string
Toleration corev1.Toleration
}
It would evaluate each expression against the whole "package" of scheduling constraints: PodSpec, all associated ResourceClaims, and the associated DeviceClasses for those. Any expression that returned "true" would result in that Toleration in the scheduling.
The nice thing here is that because it relies on the existing "taints" functionality, users can still manually add a toleration, if they need to run something on a Pod that does not use the high-value resources on a node that has them.
/cc
I see the usecase for the devices or non-core resources. I'm thinking especially about exclusive CPUs, but memory/hugepages can have the same problem.
@ffromani Considering that we have information about core resources available in the Node object (node.status.capacity, node.status.allocatable), are you considering more detailed information being published through NodeCapabilities ?. I am interested in understanding how you envision node capability to be useful with core resources. I would like to capture such requirements in the Node Capabilities proposal that I am working on.
This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.
Does the third party daemonset need to set toleration for all well-known DRA "plugins"? Or there will be universal tolerations for all DRA-implied taints?
This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.
Does the third party daemonset need to set toleration for all well-known DRA "plugins"? Or there will be universal tolerations for all DRA-implied taints?
DRA would not implement taints. Adding the taints is up to the cluster provider, just as it is today. Adding the scheduler config to tell the scheduler when to add implicit tolerations would also be up to the cluster provider.
So - no, the third part daemon set would not need to know anything about this. That's kind of the point - let the cluster provider, who knows what taints they set and for what reason, manage the way to add the tolerations.
We may not do them all through CEL. For example, it would be easier to implement some things directly in Go and have a flag or policy field to control them. We'll have to sort that out in the KEP.
That's kind of the point - let the cluster provider, who knows what taints they set and for what reason, manage the way to add the tolerations.
Makes sense. So third party DaemonSets need to know about all vendor-defined taints as today.
PodSpec and all associated Resource Claims and DeviceClasses at scheduling time
So Devices will be associated with the Pod independently from taints? How does the scheduler works today? Will it check taints first and then allocate devices? In this case, you will need to ignore all taints first, try allocate devices, than re-check which of ignored taints were fine to ignore. Or we will list all possible devices can be allocated with the associated taints, and then see which combination will result in all taints ignored by this CEL rule? So this will be a filter inside the DRA scheduler?
Makes sense. So third party DaemonSets need to know about all vendor-defined taints as today.
I am not sure what you mean? If a DaemonSet needs a GPU, for example, it won't need to know about the taint. But if there are just random taints stuck on a node, then, yes, the cluster admin will need to tolerate that taint if they want the DaemonSet to run on that node.
So Devices will be associated with the Pod independently from taints? How does the scheduler works today? Will it check taints first and then allocate devices? In this case, you will need to ignore all taints first, try allocate devices, than re-check which of ignored taints were fine to ignore. Or we will list all possible devices can be allocated with the associated taints, and then see which combination will result in all taints ignored by this CEL rule? So this will be a filter inside the DRA scheduler?
So, here's an example, maybe that will help. Consider a platform where all nodes containing GPUs are tainted with a "has-gpu, NoSched" taint. The platform admin would configure the scheduler plugin with the following extra rules:
- If the extended resources contain a GPU request, implicitly tolerate "has-gpu, NoSched"
- If the Pod references a ResourceClaim with a DeviceClass that contains a GPU, implicitly tolerate "has-gpu, NoSched".
So, that 2) is not something that would be easy (or even possible perhaps) to express in CEL. Instead, I think we need some Go-based rule/policy that the admin could leverage. We need to sort that out in the KEP. One thing I could imagine, is a rule that says "if this example device is part of any referenced DeviceClass, implicitly tolerate the taint".
In other words, I imagine the API to be a bit more than what I showed before. Maybe more like:
type ImplicitTolerationRule struct {
Selector RuleSelector
Toleration corev1.Toleration
}
type RuleSelector struct {
Type string
// for Type == 'ExtendedResource'
ResourceNames []string
// for Type == 'Device'
DevicePrototypes []resourcev1.Device
// for Type == 'CEL'
Expression *string
}
/cc
/assign
/assign
cc @ajaysundark
First of all, I'm not a big fan of implicitly doing things in Kubernetes world. I'm worried that it's opposed to the declarative resource management, Kubernetes's core philosophy. How would users or other components (e.g., the cluster autoscaler, Karpenter, descheduler, or some custom solutions) easily tell the taints are ignored for those pods? Do they all have to know the current scheduler configuration and compute which pods would ignore which taints implicitly?
A webhook does not have access to all the information it would need to add the tolerations at API admission time.
If you cannot determine which tolerations you need to add to the pods at the webhook, you can just gate the pods at the webhook, and then you can create another controller that reconciles pods, puts necessary tolerations, and then un-gate the pods. In that way, we don't need to bring an implicit behavior to the scheduler, the scheduling cycles won't be bothered by additional evaluation cost of CEL, and other components and human can understand tolerations as today.
If you cannot determine which tolerations you need to add to the pods at the webhook, you can just gate the pods at the webhook, and then you can create another controller that reconciles pods, puts necessary tolerations, and then un-gate the pods. In that way, we don't need to bring an implicit behavior to the scheduler, the scheduling cycles won't be bothered by additional evaluation cost of CEL, and other components and human can understand tolerations as today.
Yes, this is definitely our fallback position on how to handle this. It does increase latency and would require us to basically gate every pod that uses a ResourceClaim, so that's pretty heavy.
First of all, I'm not a big fan of implicitly doing things in Kubernetes world. I'm worried that it's opposed to the declarative resource management, Kubernetes's core philosophy. How would users or other components (e.g., the cluster autoscaler, Karpenter, descheduler, or some custom solutions) easily tell the taints are ignored for those pods? Do they all have to know the current scheduler configuration and compute which pods would ignore which taints implicitly?
Yes, that's a good point. But it's also true for any existing webhook or controller-based (gate) solution. This is one reason why it's a cluster admin configuration option, not something we do as part of default upstream behavior.
But it's also true for any existing webhook or controller-based (gate) solution.
Not true, right? Because existing webhooks or controllers would just put tolerations. Other components can work as it is, just by looking at those tolerations. OTOH, the scheduler after this proposal would handle taints invisibly as if the pods had tolerations. The cluster autoscaler etc cannot know the taints will be ignored at those pods' scheduling unless we give them the scheduling configuration, and change them to understand that implicitness.
Not true, right? Because existing webhooks or controllers would just put tolerations. Other components can work as it is, just by looking at those tolerations.
Hmm, yes I suppose so. I was thinking from the point of view of "examining the manifest". But as the toleration would get surfaced in the API, other things would see it at that time.
OTOH, the scheduler after this proposal would handle taints invisibly as if the pods had tolerations. The cluster autoscaler etc cannot know the taints will be ignored at those pods' scheduling unless we give them the scheduling configuration, and change them to understand that implicitness.
Good point. For the current way that cluster autoscaler works (pending pods), this is accurate. I am hoping we can move towards more of a proactive model than the current reactive model (cc @wojtek-t who is thinking about these things). But that's totally unrelated to the implicit tolerations idea.
It's a pretty big drawback to not surface those tolerations, I agree, for the reasons outlined.
So, my first preference is still the controller idea. (Whether somehow implementing it at the upstream k/k controller manager or just telling users to do it on their own is another story we can discuss.) That would be the simplest solution here.
But, the alternative idea, which works if we really mind the e2e latency like you mentioned, is having a new extension point at when new pods are entering the scheduling cache.
It would work like:
- Users put
TolerationRules on the scheduler configuration. (same as your proposal) - Instead of evaluating it at scheduling cycles and implicitly handling it, we put tolerations based on the rules when the scheduler notices the pod creation and inserting it to the scheduler cache. "put" here means we put tolerations through kube-apiserver actually. In order not to make a negative perf impact, we can just make an API call asynchronously. The scheduler keeps pod data in the scheduler cache, so the scheduling cycles can still schedule with the tolerations, even before the async API call isn't completed yet.
- The scheduling cycles and other components like CA work as they are, evaluating tolerations on the pods. There will be a tiny moment between the pod is created and the scheduler updates with the toleration though, I guess that won't be an issue?
So, compared to your initial proposal, essentially, it moves when to evaluate rules from scheduling cycles to when new pods are entering the scheduler cache, and plus, it actually updates the pods with tolerations once the rules are evaluated so that external components can see them as they see them today.
So, those two are my proposal.
/cc @kubernetes/sig-scheduling-leads
So, compared to your initial proposal, essentially, it moves when to evaluate rules from scheduling cycles to when new pods are entering the scheduler cache, and plus, it actually updates the pods with tolerations once the rules are evaluated so that external components can see them as they see them today.
Ok, that makes sense. It's sort of "automatic tolerations" rather than "implicit tolerations", and they are made explicit.
Here's a question. At the time it's added to the scheduler cache, is it guaranteed that the resource claim and device classes would exist? When does the scheduler decide "I have all the information I need to do the scheduling"? It's at that time that we would need to create the tolerations.
At the time it's added to the scheduler cache, is it guaranteed that the resource claim and device classes would exist?
Good point, it's not guaranteed because "when added to the scheduling cache" just equals "when pods are created". No further check to other resources there.
When does the scheduler decide "I have all the information I need to do the scheduling"? It's at that time that we would need to create the tolerations.
"when added to the scheduling cache/queue" and "when the scheduling cycles start to handle the pod" are different, because of PreEnqueue extension point. i.e., until all PreEnqueue plugins say OK, the scheduling cycles never attempt to schedule the pods.
So, we can park the pods in the queue by returning Unschedulable at the PreEnqueue extension point. Maybe, instead of having a new extension point at when entering the scheduling cache, we can just do with PreEnqueue extension point. A rough idea is:
- A new PreEnqueue plugin (or a new PreEnqueue() func added to the taint_toleration plugin) is inserted as the last PreEnqueue plugin. Before this new PreEnqueue(), DRA PreEnqueue() makes sure all ResourceClaim and DeviceClass etc necessary for the pod exist. i.e., the new PreEnqueue() is triggered only when DRA PreEnqueue() has made sure all things are ready.
- A new PreEnqueue() evaluates
TolerationRules and puts the tolerations on the pod. The update to the pod is asynchronously reflected to kube-apiserver. (Also, again, given the scheduling cycles refer to the pod date given from the scheduling queue/cache, they can perform scheduling with those new tolerations, even if the async APi call hasn't completed yet.)
That's getting a bit complex... but I believe it should work, at least technically. Again, from the scheduler's perspective, the controller idea is the simplest (and I still prefer that one)... We should compare those two options based on pros vs cons, and discuss which we should go with, or explore other options from there.
Also, one concern I came up with right now is the burden on kube-apiserver. In this case, depending on the rules that users give to the scheduler config though, the scheduler could make one additional API call per a new Pod at the biggest (← when the rules match all pods). We'd need to discuss whether those additional API calls from the scheduler is ok or not with sig-scalability.
Also, JFYI though, https://github.com/kubernetes/enhancements/pull/5249 is related to this idea. We're actually discussing how we can make such asynchronous API calls from the scheduler. cc @macsko
The tolerations could change again later as ResourceClaims and/or DeviceClasses get updated. It can't be a one-time operation, but rather has to be a proper controller.
If it's intended to support mutable fields as well, then Yes. Both the separate controller idea and the PreEnqueue idea should be able to handle the updates, though, the num of necessary API calls could be increased to more than one API call per new Pod.
ResourceClaim and DeviceClass don't even need to be mutable. They could get replaced.
From what I see we have a few different attempts to bypass the problem of api-server latency overhead and lack of atomicity. Similar ones are KEP-5055: DRA: admin-controlled attributes and Node Readiness Gates (see my comment).
We also try to bypass this problem in scheduler with KEP-5229: Asynchronous API calls during scheduling. We definitely should avoid solving this problem paying the cost of introducing inconsistencies.
However, I don't see baking in an implicit logic as something that is completely wrong, as long as it is deterministic and all components can understand it, so that we can achieve eventual consistency.
Implicit logic is indeed fragile to getting out of sync on updates (version skew) or when some components are nor compatible with it. I think we should rather try to put that logic at least into some library (to assure the consistency), but not the the scheduler only.
Edited:
Previously I skimmed the proposal too fast, but I still think we could allow defining the implicit rules similarly how it's proposed in KEP-5055: DRA: admin-controlled attributes and define ImplicitTolerationRule as a top level object (instead of being a part of the scheduler config).
However, shouldn't we put the implicit toleration on a ResourceClaim instead of a pod? Similarly, shouldn't GPU taints be on ResourceSlices (devices) instead of nodes? ~I know that using extended resources and DRA resources at the same time complicates things, but I think think that putting things this way should simplify this proposal.~ This way it would be clear which ResourceClaim needs tolerations and it could be even applied by resource claim controller (or api-server webhook when ResourceClaim is created), so there would be no need to involve scheduler.
Perhaps host-network pods could implicitly tolerate network-unavailable.
From the linked discussion:
Taints and tolerations were originally intended more as an administrative function. We have been using them, along with node labels, as ways to guide scheduling when there is relevant scheduling information that is not actually known to the scheduler.
The fact that taints are used for both "this not is not ready" (eg, node.cloudprovider.kubernetes.io/uninitialized) and "this node is reserved for a special purpose" (eg, node-role.kubernetes.io/master, "has-gpu") is annoying. You can't really reliably say "DNS pods should get scheduled to all nodes (even special-purpose nodes)" because if you want to tolerate arbitrary unknown special purposes, you'd need to tolerate all taints, but then you'd also be tolerating unschedulable and network-unavailable and memory-pressure and various other things you didn't want.
So maybe this KEP is an opportunity to thing about a larger change to taints/tolerations?
So maybe this KEP is an opportunity to thing about a larger change to taints/tolerations?
I think this in combination with the discussions around node capabilities and readiness are all converging around whether the existing taints and tolerations feature is adequate for our existing needs, or if we need something similar, but not exactly the same.
Personally, I think there is a need for some sort of automated tolerations & discovered taints which do not "pollute" the administratively assigned controls.
With existing taints/tolerations, users have to be aware of which taints represent which characteristics. I mean, taints show characteristics of the nodes: not-ready is one of the characteristics, having-expensive-gpu is another characteristic. And, on the pod side, end users have to explicitly declare this pod wants to be (or can be) scheduled to nodes with this kind of characteristics, with tolerations. I think the point that this issue tries to highlight is: shifting the responsibility to set tolerations from end users to the cluster admin. Right now, users have to be aware of "this pod needs to use expensive GPUs, and I have to put this toleration", but after this KEP, the cluster admin can say "these kind of pods need to use expensive GPUs, and will automatically get this toleration", and end users only need to declare the pods need to use GPUs on the resource request or ResourceClaim. i.e., end users don't have to be aware of which kinds of nodes they have to explicitly request on the tolerations based on their pods; Instead the cluster admin utilizes this feature to automatically detect those.