node-feature-discovery
node-feature-discovery copied to clipboard
feature to allow optionally setting taints based on node properties
What would you like to be added:
It would be nice if NFD could be configured with options to set node taints as well as labels, based on certain features of nodes. Would you consider that in scope of NFD?
Why is this needed: Cluster operators may wish to automatically taint nodes with certain features, for example tainting a node that has GPUs to prevent other pods from running on it if they don't actually need (tolerate) the GPUs.
Hi @rptaylor. Yes, this would be useful and I've been thinking this myself as part of the work I've done on #464 and particularly #468 which are very much on prototype level, still.
I've done some initial experiments and started to think whether it should be possible to taint only some of the nodes (a configurable proportion). WDYT? This complicates things (in implementation) quite a bit, though so maybe that would be a future enhancement.
Okay, nice @marquiz . It makes sense to me that NFD could have the flexibility and generalization to apply arbitrary properties (taints as well as labels) based on the features of nodes.
What would be the use case to only taint a portion of nodes with a given feature and configuration?
What would be the use case to only taint a portion of nodes with a given feature and configuration?
Reserving some of the nodes for general usage or alternatively reserving only a fraction of the nodes for special workloads. Dunno if that is useful in practice 🤔
To taint only a subset of nodes in a cluster makes kinda sense for extended resources. If a Pod is allocating an extended resource, the ExtendedResourceAdmissionWebhook will automatically tolerate the extended resource taint.
But we need to be careful when we're applying this taint. In some cases of hardware enablement usually adding the taint is when the extended resources are exposed.
It also depends on how you want to partition your cluster. We used taints and tolerations for a "hard-partitioning" meaning no workloads allowed that are not tolerating the taint, repelling workloads.
Or using "soft-partitioning" e.g. with priority classes to have mixed workloads but special workloads could have higher priorities.
Another use-case would be behavioural partitioning, let's say e.g. you have one cluster and want to do some AI/ML pipeline, one could imagine to taint some nodes as inference
and others as training
or data-lake
resembling a pipeline in one cluster rather then having several clusters each for "one" specific feature.
If the extended resources are equivalent on a number of nodes, making the nodes fungible, it doesn't make sense to me to divide them into separate hard partitions. In a traditional batch system partitioning creates significant challenges in practice, especially at large scales, and this would be handled by fair-share scheduling instead, but that is a big missing feature of Kubernetes (I think Volcano may have this). PriorityClasses are not enough.
It is a fundamental trade-off in scheduling theory between latency and throughput; partitioning will inevitably reduce usage efficiency (throughput) but can improve latency (nodes are reserved for you so available right away), but that has to be balanced against the risk of filling up your partition and the probably larger benefit of being able to use other partitions when available, if all nodes were in the same shared pool instead.
Even with a relatively steady state workload (as opposed to dynamic and bursty), wouldn't it be better to use resource quotas for each app (inference/training/etc) as a floating reservation across any available node rather than locking certain apps to a specific subset of nodes? Anyway my perspective is from a scientific HPC background, other situations could have totally different needs and considerations that I am not familiar with. Best to build a tool that provides sufficient options so anyone can use it however they need :)
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
I have plans to implement this on top of #553.
@rptaylor I think I agree with you above. Partial/proportional tainting is much more complicated with problematic corner cases (e.g. with cluster auto-scaling), not to mention the problems of optimal scheduling and resource usage you talked about above.
/remove-lifecycle stale
For consistency, we'd need to support this for both nfd-worker config (configuration of custom
source), I think. This means that we need to update our gRPC interface, too, to send the taints from worker to master. Also, we prolly need to add an annotation for bookkeeping, (similar to nfd.node.kubernetes.io/feature-labels
and nfd.node.kubernetes.io/extended-resources
)
Moving to v0.12.0
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
We still want this. Not a huge deal in terms of implementation but somebody® just has to do it
/remove-lifecycle stale
I'm interested to work on this. /assign
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale /lifecycle active
this is being reviewed right now in https://github.com/kubernetes-sigs/node-feature-discovery/pull/910