nri Looking for collaborators / ideas on how to leverage an NRI plugin for the purpose of managing identity artifacts in containers/pods

The idea is about using an NRI plugin to manage setup of identity artifcats for container/pods. Oriented around Spiffe/Spire, the idea is to create and mount identity artifacts/certificate during container creation time. Instead of the application in the container creating and fetching its identify artificats, this NRI identity plugin would manage setting it up for the application/workload. An alternative way is to extend Envoy to manage identities for the application. An additional idea which I don't fully understand yet is to let applications access files outside the container by mounting a root hosted file.

This is from a rough discussion I had with @mikebrow

Open Questions:

Spiffe SVID certificates/documents are short lived. What should happen to the container / pod after the certificate expires?
- Restart or fail the pod/container.
- Pause the container.
- Do some tricks to update the mounted certificate/artifact dynamically.

Some links from our discussions:

https://github.com/spiffe/spire-plugin-sdk
https://medium.com/kagenti-the-agentic-platform/identity-in-agentic-platforms-enabling-secure-least-privilege-access-996527f1c983
https://github.com/kagenti/kagenti/tree/main/kagenti/examples/identity

Slack discussion thread: https://cloud-native.slack.com/archives/CGEQHPYF4/p1760517445810219

Oct 15 '25 08:10 atpugtihsrah

@atpugtihsrah Thanks for reaching out with this.

Do you have some specific use case or concrete example in mind about what this could be used for in practice ? Just to provide a bit more context which could then help folks unfamiliar with Spiffe/Spire (or like myself, with the whole problem space at large) to better grasp the idea and goals here.

I've looked at the medium article you linked to, trying to put some things into perspective/understand the big picture. With that in mind, are these assumptions correct:

the NRI plugin you describe here is (in a way a part of) the Agent Orchestrator
the workloads this plugin would be handling are the agents themselves, trying to access some 3rd party tools
the identity artifacts are expected by those agents to be in place (IOW they know how to use them), for gaining access to those tools

The idea is about using an NRI plugin to manage setup of identity artifcats for container/pods. Oriented around Spiffe/Spire, the idea is to create and mount identity artifacts/certificate during container creation time.

Now, this is a way too early comment from me, because it is related to an implementation detail, but I thought it's good to get it out of the way early on. As a rule of thumb, you should not attempt to take any time-consuming action synchronously when you are processing an NRI CreateContainer event (so IOW you are in the process of making final adjustments to a container being soon created). So in this case I think you should simply create and bind-mount the artifact directory into the container, and fire off an asynchronous Spiffe/Spire request (if there is such a thing) for creating the container's identity artifacts. Then write the artifacts in the directory once Spiffe/Spire is done with it (and hope that this has happened by the time the container is started).

Instead of the application in the container creating and fetching its identify artificats, this NRI identity plugin would manage setting it up for the application/workload. An alternative way is to extend Envoy to manage identities for the application. An additional idea which I don't fully understand yet is to let applications access files outside the container by mounting a root hosted file.

I think the only way to inject the artifacts into the agent/workload's container is to create a container-specific directory on the host and bind mount it into the container.

This is from a rough discussion I had with @mikebrow

Open Questions:

Spiffe SVID certificates/documents are short lived. What should happen to the container / pod after the certificate expires?

Restart or fail the pod/container.

This is possible, but currently in a way they are the same thing in NRI. There is no way (for the runtime) to permanently fail or evict a container from a host. AFAIK, there is simply no mechanism for this in the CRI protocol. So an NRI triggered eviction just stops the container. The higher-level K8s controller responsible for the container's lifecycle will then act in response to this, usually (removing, recreating and then) restarting the container (unless its restart policy dictates otherwise).

Pause the container.

I think this is not really possible. Unlike docker containers, it is conceptually not possible to pause a K8s container

Do some tricks to update the mounted certificate/artifact dynamically.

If you inject artifacts using a bind mount, then the mechanics of accomplishing this should be easy. Just update the artifacts in the directory you bind-mounted and that's it.

Some links from our discussions:

https://github.com/spiffe/spire-plugin-sdk

https://medium.com/kagenti-the-agentic-platform/identity-in-agentic-platforms-enabling-secure-least-privilege-access-996527f1c983

https://github.com/kagenti/kagenti/tree/main/kagenti/examples/identity

Slack discussion thread: https://cloud-native.slack.com/archives/CGEQHPYF4/p1760517445810219

Oct 17 '25 06:10 klihub

Great points @klihub .. will need this to work with VM isolated containers, should be fine.. would be nice to use readonly in the container by default. I really like setting up the bind mount first in create.. then in parallel updating the contents to meet that and make sure it's finished before start. Might want the option of a readiness probe. Might also want to be able to tie this to pod lifecycle and make it available to selected containers in the pod.. with various sub-paths.

As for pausing... we do that today during a checkpoint is how criu works.. Nod to the point about k8s not understanding those additional states pod/container. Back of my head it was more of a pause option while updating the contents with a new cert in that the application running in the container could have a trigger on.

Oct 17 '25 16:10 mikebrow

Thanks @klihub for your comments! I also a noob here 😅 noob to both container runtimes and identity.

Do you have some specific use case or concrete example in mind about what this could be used for in practice ?

This is all that we had in mind at the moment. How I personally see it as is "identity by default". I am exploring a few other identity frameworks but Spiffe/Spire's solution for password-less "secret zero" and short lived certificates is very compelling. @mikebrow Did you have any other uses cases in mind?

the NRI plugin you describe here is (in a way a part of) the Agent Orchestrator

No no. How I understand about an NRI plugin is (and please correct me if I am wrong), is that it is a process that runs on every node. An NRI plugin can subscribe to events of all containers on its node. So how I imagine this NRI Identity plugin to work is that it is basically a proxy that is fetching identity documents for each and every container on its node. So its not at all part of the Agent Orchestrator. Meaning this NRI Identity Plugin will fetch the identity document for Agent Orchestrator and mount it, it will also fetch the identity document for each AI Agent and mount it. Using an identity plugin to mount identity certificates to AI Agent is the most attractive use case right now (because of how agents work).

As I understand it, Spiffe/Spire does not have its own in-built authorization/policy engine yet. Identity answers "who is calling" but authorization/policy answers "what can the caller do". There is some work in that area so in the future either this identity plugin or perhaps a separate policy plugin will also fetch policy/authorization details. Reference Issue: https://github.com/spiffe/spire/issues/1975

the identity artifacts are expected by those agents to be in place (IOW they know how to use them), for gaining access to those tools

This plugin would not just be for AI or agents. This plugin could also be used to with Kafka or other message brokers. It could be used on containers of applications reading or writing to message queues. This plugin should be possible to use with anything that can be containerised and needs identity for talking.

the workloads this plugin would be handling are the agents themselves, trying to access some 3rd party tools

Any and all workloads. Agents are the first thought these days because of how they can call APIs of multiple products in a single user request/chat. Another area to use this plugin would be CICD applications/containers. It should be possible for all of Netflix's famous mircoservices to use this plugin to fetch their identity documents.

As a rule of thumb, you should not attempt to take any time-consuming action synchronously when you are processing an NRI CreateContainer event (so IOW you are in the process of making final adjustments to a container being soon created).

Thanks, noted! Really good point!

Oct 18 '25 12:10 atpugtihsrah

will need this to work with VM isolated containers

Thanks, another good point!

Oct 18 '25 12:10 atpugtihsrah

the NRI plugin you describe here is (in a way a part of) the Agent Orchestrator

More specifically it's a proposal to make the magic happen via NRI plugin vs running an init container.

Oct 20 '25 15:10 mikebrow

See device injector's mount annotations for an example...

doc: https://github.com/containerd/nri/tree/main/plugins/device-injector#mount-annotations

sample pod spec annotation for adding a mount to a container: https://github.com/containerd/nri/blob/main/plugins/device-injector/sample-device-inject.yaml#L24-L30

code: https://github.com/containerd/nri/blob/main/plugins/device-injector/device-injector.go#L206-L230

Oct 31 '25 14:10 mikebrow

FYI, another approach I've seen is to use a CSI driver to deliver SPIFFE SVIDs to pods:

https://github.com/cert-manager/csi-driver-spiffe

Just out of curiosity, are there specific properties that make the NRI approach preferable here? (I'm not super familiar with VM-isolated containers and whether they would work with CSI volumes.)

Nov 21 '25 23:11 grosskur

timing problems, principles of least privilege, and resource overhead all point to the container runtime itself is best suited to these particular tasks.

Nov 24 '25 15:11 mikebrow