node-feature-discovery icon indicating copy to clipboard operation
node-feature-discovery copied to clipboard

KEP: NodeFeatureGroup API (CRD)

Open ArangoGutierrez opened this issue 2 years ago • 28 comments

Summary

The Kubernetes cluster object doesn't expose all available features in a programmatic way.

When working in a MultiCluster environment (example kcp, hypershift ) the central control plane can not access all the available features on each cluster, making it hard to take scheduling and management decisions.

The Node-Feature-Discovery does a good work for exposing a per-node basis feature inventory but querying each cluster in a per-node basis can be a network intensive task. Various use cases have been identified where having a cluster inventory would facilitate operations at the Cluster management level.

This KEP proposes NFD to expose an inventory of available features in the cluster via a new API (CRD)

Goals

  • make the information about all clusters easy to query via a centralised API
  • expose cluster wide features currently not reported by NFD in a per node scenario, e.g Network config

Non-Goals

  • change existing behaviour at the node level
  • To be a MultiCluster management tool, this API is to expose NFD discovered features via a single API (CRD)

Proposal

User Stories

Story 1

As a platform engineer, I want to known the available features on each cluster registered on my network to be able to make optimal, platform specific, scheduling decisions.

Story 2

As a System-Admin I want a single API to know the available features of each cluster on the network.

resource allocations.

CRD API

// NodeFeatureGroup resource holds the features discovered for all nodes in a
// cluster.
// +kubebuilder:object:root=true
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// +genclient
type NodeFeatureGroup struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec ClusterFeatureSpec `json:"spec"`
}

// NodeFeatureGroupSpec describes a ClusterFeature object.
type NodeFeatureGroupSpec struct {
	// FeatureGroup is a set of grouped objects by specific features
	// +optional
	FeatureGroup []FeatureGroupSpec `json:"FeatureGroupSpec"`
	// Features is the set of cluster wide features that are not reported at the node level.
	// +optional
	ClusterFeatures []ClusterFeatures`json:"clusterFeatures"`
}

ArangoGutierrez avatar Oct 20 '23 10:10 ArangoGutierrez

/assign

ArangoGutierrez avatar Oct 20 '23 10:10 ArangoGutierrez

@ArangoGutierrez Do you mind clarifying on make the information about all clusters easy to query via a centralised API?

I am trying to understand what is it you are looking for. There are multiple multi-cluster management tools already available, is this a feature request? or do you have a specific product/project in mind you want to extend?

zanetworker avatar Oct 20 '23 14:10 zanetworker

@ArangoGutierrez Do you mind clarifying on make the information about all clusters easy to query via a centralised API?

I am trying to understand what is it you are looking for. There are multiple multi-cluster management tools already available, is this a feature request? or do you have a specific product/project in mind you want to extend?

Sure, as I said

The Node-Feature-Discovery does a good work for exposing a per-node basis feature inventory but querying each cluster in a per-node basis can be a network intensive task

When I refer to a single API, I am saying that instead of having to query all the nodes for the created labels (NFD Labels), the new ClusterFeature CRD will be an aggregator, so application developers can have a controller to Watch for events on a single CRD, and be informed if the cluster got a new node, and the features of said node. This is a new CRD exposed by NFD, to group discovered features, by no means is Yet-Another-Multicluster-management-tool.

ArangoGutierrez avatar Oct 20 '23 14:10 ArangoGutierrez

would there be a way that this work intersects with the open-cluster-management project? https://open-cluster-management.io/

berenss avatar Oct 20 '23 15:10 berenss

would there be a way that this work intersects with the open-cluster-management project? https://open-cluster-management.io/

Hey! No it won't NFD is not a Cluster management tool, our aim is to provide an easy and programatic way to expose all features via CRD's / Labels / annotations, so Developers/users can act on them. The ClusterFeature CRD will basically be a cluster wide inventory of available resources. NFD discovered resources, are extra from the ones advertised to the Kubelet. We want to be able to host a Cluster wide inventory of specific features, like GPU type, specific CPU features like TDX os SMP, this are not exposed by default Kubernetes tools.

ArangoGutierrez avatar Oct 20 '23 15:10 ArangoGutierrez

right! let me get more specific, within o-c-m project, the placement would intersect quite nicely with NFD to allow workloads to land upon nodes with specific features. in other words, the scaffolding is already there for NFD to become a first class provider into o-c-m's placement. perhaps I need to work this connection from the other side and introduce o-c-m to NFD https://open-cluster-management.io/scenarios/distribute-workload-with-placement/

berenss avatar Oct 20 '23 15:10 berenss

I also got a bit of a heads-up from an engineer that works in o-c-m, and he shared some additional insights

It sounds kind of similar to the Cluster Inventory project that @qiujian16 has been pushing for. It was presented in the sig mc for a few rounds and finally got the go ahead from the sig mc chairs. The repo: https://github.com/kubernetes-sigs/cluster-inventory-api @qiujian16's KEP in his personal repo for now: https://github.com/qiujian16/k8s-enhancements/tree/cluster-inventory/keps/sig-multicluster/cluster-inventory SIG-MC kep draft presentation: https://docs.google.com/document/d/1sUWbe81BTclQ4Uax3flnCoKtEWngH-JA9MyCqljJCBM/

berenss avatar Oct 20 '23 16:10 berenss

right! let me get more specific, within o-c-m project, the placement would intersect quite nicely with NFD to allow workloads to land upon nodes with specific features. in other words, the scaffolding is already there for NFD to become a first class provider into o-c-m's placement. perhaps I need to work this connection from the other side and introduce o-c-m to NFD https://open-cluster-management.io/scenarios/distribute-workload-with-placement/

Hey! we would love to help introduce o-c-m to NFD, cc @marquiz and I are always looks to help with NFD as much as we can

ArangoGutierrez avatar Oct 20 '23 16:10 ArangoGutierrez

To my understanding, this is to collect features in a cluster and have a singleton API in this cluster to summarize all the features from nodes? This is a bit different from cluster-inventory-api, since the latter requires a cluster management control plane. However I think there is another project seeming similar from sig-mc (https://github.com/kubernetes-sigs/about-api) to introduce a ClusterProperty API to expose arbitrary properties of the cluster.

qiujian16 avatar Oct 23 '23 02:10 qiujian16

A lot of action in this space. A few random thoughts from an uneducated person:

  • The O-C-M looks cool. Could be nice in providing a centralized place to store/query cluster info in the hub cluster
  • The about-api looks very sketchy, with basically only one property (string value) per API object. For our purposes we'd need to enhance/extend the API.
  • The cluster-inventory API could be used with O-C-M(?)

marquiz avatar Oct 23 '23 09:10 marquiz

I was tagged in slack about this :) @mwielgus do you think there could be any implications/simplifications here for Multi cluster Kueue?

alculquicondor avatar Oct 23 '23 15:10 alculquicondor

A lot of action in this space. A few random thoughts from an uneducated person:

  • The cluster-inventory API could be used with O-C-M(?)

yes, that is the plan.

qiujian16 avatar Oct 24 '23 01:10 qiujian16

@ArangoGutierrez I think I understand the change now, and I agree this would be great for Fluence. Do you want any help?

vsoch avatar Oct 25 '23 19:10 vsoch

Is this KEP actively being worked on?

If not, happy to get the ball rolling by creating a first draft.

Sharpz7 avatar Oct 26 '23 14:10 Sharpz7

yes, this week we are just getting attention from the community, before working on it

ArangoGutierrez avatar Oct 26 '23 16:10 ArangoGutierrez

@ArangoGutierrez is this KEP concern the creation of ClusterFeature CR in the cluster, or it will also try to integrate with cluster management by extending, for example ClusterClaim?

yevgeny-shnaidman avatar Oct 29 '23 13:10 yevgeny-shnaidman

Hey! 👋🏻 I'm curious about what cluster features would be exposed exactly. kcp (which is mentioned in the initial description and was notified of this) in specific is not dealing with compute workloads directly, so aggregation of NFD-discovered features would not relate to it. Because of that, I'm mostly interested in this part:

expose cluster wide features currently not reported by NFD in a per node scenario, e.g Network config

Are there any clear ideas what examples there could be beyond network config?

embik avatar Oct 29 '23 15:10 embik

All, me here to lead, what is this intersection NFD y'all are talking about, I assume it could reduce k8 costs

What do I need to study to contribute

GeoEducator avatar Oct 30 '23 09:10 GeoEducator

Non-Goals

  • change existing behaviour at the node level
  • To be a MultiCluster management tool, this API is to expose NFD discovered features via a single API (CRD)

Hi @yevgeny-shnaidman , it is a Non-goal to walk into cluster management territory

ArangoGutierrez avatar Oct 30 '23 10:10 ArangoGutierrez

Hey! 👋🏻 I'm curious about what cluster features would be exposed exactly. kcp (which is mentioned in the initial description and was notified of this) in specific is not dealing with compute workloads directly, so aggregation of NFD-discovered features would not relate to it. Because of that, I'm mostly interested in this part:

expose cluster wide features currently not reported by NFD in a per node scenario, e.g Network config

Are there any clear ideas what examples there could be beyond network config?

Hey @embik ! we are gathering requests from multiple places. NFD is a per-node feature discovery solution, to help address the needs of Multi Cluster environments, this ClusterFeature/ClusterInventory API must/should also disclose things that are at the cluster level. So far we have heard a lot to get cluster wide Network config/features/capabilities and in a long term future potentially Topology (for MPI users). there are other ideas like Storage, cluster health, etc. If you have an idea, please feel free to share!

ArangoGutierrez avatar Oct 30 '23 11:10 ArangoGutierrez

@ArangoGutierrez are we going to add something like NodeFeatureRules for the new CRD? i am guessing that it can come useful to determine if cluster supports GPU loads etc'

yevgeny-shnaidman avatar Oct 31 '23 10:10 yevgeny-shnaidman

@ArangoGutierrez are we going to add something like NodeFeatureRules for the new CRD? i am guessing that it can come useful to determine if cluster supports GPU loads etc'

ClusterFeatureRules , that could be an addition, initially we aim for a CRD like NodeFeature but at a Cluster level, later on we could add ways of modifying it, like you mention with rules

ArangoGutierrez avatar Oct 31 '23 10:10 ArangoGutierrez

Yes, maybe this could be a further addition/enhancement if some rule-based aggregation of features would be needed

marquiz avatar Oct 31 '23 13:10 marquiz

I'm trying to understand what available features this ClusterFeature provides, an example would be great in addition to the API.

RainbowMango avatar Nov 02 '23 02:11 RainbowMango

I'm trying to understand what available features this ClusterFeature provides, an example would be great in addition to the API.

Hey, sure! You can find all the feature sources NFD discovers and advertise on a per-Node basis here -> https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/usage/features.html#table-of-contents

ArangoGutierrez avatar Nov 02 '23 09:11 ArangoGutierrez

For all those interested I have filed https://github.com/kubernetes-sigs/node-feature-discovery/pull/1487

ArangoGutierrez avatar Dec 01 '23 17:12 ArangoGutierrez

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 29 '24 18:02 k8s-triage-robot

/remove-lifecycle stale

ArangoGutierrez avatar Mar 04 '24 10:03 ArangoGutierrez