node-feature-discovery
node-feature-discovery copied to clipboard
Add support to expose CRI engine configuration via NFD
As a developer of Kata Containers, we'd like to rely on NFD to properly check whether containerd has the appropriate snapshotter set up in one (or more) specific nodes, so we can decide whether some of the Kata Containers drivers there, and also provide the user an appropriate rule to schedule their workloads.
The reasoning behind this is detecting:
- devmapper snapshotter, for Firecracker
- nydus snapshotter, for any Confidential Containers workload
While I know that the preferred way to deploy Kata Containers would be just baking it into the node image, we know that users trying out Kata Containers usually rely on our daemon-set, and then get confused on why a specific driver (VMM) doesn't work, as specific drivers require specific snapshotters.
cc @zvonkok @mythi @marquiz
The way I'd like to see this exposed is something like:
container-engine.containerd.snapshotter.devmapper or container-engine.containerd.snapshotter.nydus.
So you are proposing a new feature source called container-engine, right?
We could probably put this under the system source.
Trying to understand how this would work (and the possible caveats and corner cases), we'd need to parse the containerd config, right(?) That should be usually readable by non-root. But the snapshotter can depend on the runtime class? Should we take that into account?
I understand this is a containerd centric feature, but could we have a generic way to parse "container-runtime" config files, so we are future proof and add to this feature request {docker,podman,crio} ?
Trying to understand how this would work (and the possible caveats and corner cases), we'd need to parse the containerd config, right(?) That should be usually readable by non-root. But the snapshotter can depend on the runtime class? Should we take that into account?
containedconfiguration just tells whether a snapshotter is actually being used by a runtime class.
We want to know whether a snapshotter is in the system before we tie it to a runtime handler.
In a very hacky way, ctr plugin ls is what we want to check (and there's a way to import containerd client package and do it without having to call the tool), and there we can expose the snapshotters.
Here's the output, for instance:
TYPE ID PLATFORMS STATUS
io.containerd.content.v1 content - ok
io.containerd.snapshotter.v1 aufs linux/amd64 skip
io.containerd.snapshotter.v1 btrfs linux/amd64 ok
io.containerd.snapshotter.v1 devmapper linux/amd64 error
io.containerd.snapshotter.v1 native linux/amd64 ok
io.containerd.snapshotter.v1 overlayfs linux/amd64 ok
io.containerd.snapshotter.v1 zfs linux/amd64 skip
io.containerd.metadata.v1 bolt - ok
io.containerd.differ.v1 walking linux/amd64 ok
io.containerd.event.v1 exchange - ok
io.containerd.gc.v1 scheduler - ok
io.containerd.service.v1 introspection-service - ok
io.containerd.service.v1 containers-service - ok
io.containerd.service.v1 content-service - ok
io.containerd.service.v1 diff-service - ok
io.containerd.service.v1 images-service - ok
io.containerd.service.v1 leases-service - ok
io.containerd.service.v1 namespaces-service - ok
io.containerd.service.v1 snapshots-service - ok
io.containerd.runtime.v1 linux linux/amd64 ok
io.containerd.runtime.v2 task linux/amd64 ok
io.containerd.monitor.v1 cgroups linux/amd64 ok
io.containerd.service.v1 tasks-service - ok
io.containerd.grpc.v1 introspection - ok
io.containerd.internal.v1 restart - ok
io.containerd.grpc.v1 containers - ok
io.containerd.grpc.v1 content - ok
io.containerd.grpc.v1 diff - ok
io.containerd.grpc.v1 events - ok
io.containerd.grpc.v1 healthcheck - ok
io.containerd.grpc.v1 images - ok
io.containerd.grpc.v1 leases - ok
io.containerd.grpc.v1 namespaces - ok
io.containerd.internal.v1 opt - ok
io.containerd.grpc.v1 snapshots - ok
io.containerd.grpc.v1 tasks - ok
io.containerd.grpc.v1 version - ok
io.containerd.tracing.processor.v1 otlp - skip
io.containerd.internal.v1 tracing - ok
I was talking to @mythi, and he mentioned he'd also like to check whether nri is enabled or not, which may be a second use case for this.
Mm, I immediately see two problems here. containerd.sock (and ctr) requires root access, plus our container base image is scratch and we only ship nfd binaries, nothing else (and going forward we'd prolly want to keep it that way)
Mm, I immediately see two problems here. containerd.sock (and
ctr) requires root access, plus our container base image is scratch and we only ship nfd binaries, nothing else (and going forward we'd prolly want to keep it that way)
Just to make it clear, I'm not suggesting to ship ctr, but rather implement it on our end using the go package provided by containerd. Now, containerd.sock does require root, indeed. :-/
/cc @zvonkok
What about a side-car container that writes to /etc/kubernetes/node-feature-discovery/features.d` ? Like what GPU feature-discovery is doing? Once you have NFD deployed, kata-deploy can deploy "anything" in a side-car container to detect "anything"?
What about a side-car container that writes to /etc/kubernetes/node-feature-discovery/features.d` ? Like what GPU feature-discovery is doing? Once you have NFD deployed, kata-deploy can deploy "anything" in a side-car container to detect "anything"?
Good point. a feature hook using the local source could probably work.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
/remove-lifecycle stale
@fidencio @zvonkok @mythi is this topic still relevant?