node-feature-discovery icon indicating copy to clipboard operation
node-feature-discovery copied to clipboard

feature to allow optionally setting taints based on node properties

Open rptaylor opened this issue 3 years ago • 12 comments

What would you like to be added:

It would be nice if NFD could be configured with options to set node taints as well as labels, based on certain features of nodes. Would you consider that in scope of NFD?

Why is this needed: Cluster operators may wish to automatically taint nodes with certain features, for example tainting a node that has GPUs to prevent other pods from running on it if they don't actually need (tolerate) the GPUs.

rptaylor avatar Jun 10 '21 02:06 rptaylor

Hi @rptaylor. Yes, this would be useful and I've been thinking this myself as part of the work I've done on #464 and particularly #468 which are very much on prototype level, still.

I've done some initial experiments and started to think whether it should be possible to taint only some of the nodes (a configurable proportion). WDYT? This complicates things (in implementation) quite a bit, though so maybe that would be a future enhancement.

marquiz avatar Jun 10 '21 04:06 marquiz

Okay, nice @marquiz . It makes sense to me that NFD could have the flexibility and generalization to apply arbitrary properties (taints as well as labels) based on the features of nodes.

What would be the use case to only taint a portion of nodes with a given feature and configuration?

rptaylor avatar Jun 10 '21 19:06 rptaylor

What would be the use case to only taint a portion of nodes with a given feature and configuration?

Reserving some of the nodes for general usage or alternatively reserving only a fraction of the nodes for special workloads. Dunno if that is useful in practice 🤔

marquiz avatar Jun 11 '21 05:06 marquiz

To taint only a subset of nodes in a cluster makes kinda sense for extended resources. If a Pod is allocating an extended resource, the ExtendedResourceAdmissionWebhook will automatically tolerate the extended resource taint.

But we need to be careful when we're applying this taint. In some cases of hardware enablement usually adding the taint is when the extended resources are exposed.

It also depends on how you want to partition your cluster. We used taints and tolerations for a "hard-partitioning" meaning no workloads allowed that are not tolerating the taint, repelling workloads.

Or using "soft-partitioning" e.g. with priority classes to have mixed workloads but special workloads could have higher priorities.

Another use-case would be behavioural partitioning, let's say e.g. you have one cluster and want to do some AI/ML pipeline, one could imagine to taint some nodes as inference and others as training or data-lake resembling a pipeline in one cluster rather then having several clusters each for "one" specific feature.

zvonkok avatar Jun 11 '21 07:06 zvonkok

If the extended resources are equivalent on a number of nodes, making the nodes fungible, it doesn't make sense to me to divide them into separate hard partitions. In a traditional batch system partitioning creates significant challenges in practice, especially at large scales, and this would be handled by fair-share scheduling instead, but that is a big missing feature of Kubernetes (I think Volcano may have this). PriorityClasses are not enough.

It is a fundamental trade-off in scheduling theory between latency and throughput; partitioning will inevitably reduce usage efficiency (throughput) but can improve latency (nodes are reserved for you so available right away), but that has to be balanced against the risk of filling up your partition and the probably larger benefit of being able to use other partitions when available, if all nodes were in the same shared pool instead.

Even with a relatively steady state workload (as opposed to dynamic and bursty), wouldn't it be better to use resource quotas for each app (inference/training/etc) as a floating reservation across any available node rather than locking certain apps to a specific subset of nodes? Anyway my perspective is from a scientific HPC background, other situations could have totally different needs and considerations that I am not familiar with. Best to build a tool that provides sufficient options so anyone can use it however they need :)

rptaylor avatar Jun 11 '21 19:06 rptaylor

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 09 '21 20:09 k8s-triage-robot

I have plans to implement this on top of #553.

@rptaylor I think I agree with you above. Partial/proportional tainting is much more complicated with problematic corner cases (e.g. with cluster auto-scaling), not to mention the problems of optimal scheduling and resource usage you talked about above.

/remove-lifecycle stale

marquiz avatar Sep 10 '21 06:09 marquiz

For consistency, we'd need to support this for both nfd-worker config (configuration of custom source), I think. This means that we need to update our gRPC interface, too, to send the taints from worker to master. Also, we prolly need to add an annotation for bookkeeping, (similar to nfd.node.kubernetes.io/feature-labels and nfd.node.kubernetes.io/extended-resources)

marquiz avatar Jan 14 '22 08:01 marquiz

Moving to v0.12.0

marquiz avatar Mar 17 '22 08:03 marquiz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 15 '22 09:06 k8s-triage-robot

We still want this. Not a huge deal in terms of implementation but somebody® just has to do it

/remove-lifecycle stale

marquiz avatar Jul 08 '22 12:07 marquiz

I'm interested to work on this. /assign

fmuyassarov avatar Aug 31 '22 08:08 fmuyassarov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 29 '22 09:11 k8s-triage-robot

/remove-lifecycle stale /lifecycle active

fmuyassarov avatar Nov 29 '22 09:11 fmuyassarov

this is being reviewed right now in https://github.com/kubernetes-sigs/node-feature-discovery/pull/910

fmuyassarov avatar Nov 29 '22 09:11 fmuyassarov