nri [RFC] Add v1beta1 API revision, protocol translation for v1alpha1 plugins.

The primary motivation for this PR is to clean up some annoying bits in the NRI wire protocol.

The most prominent such bit is the lack of proper, dedicated wire RPC calls and messages for some pod and container lifecycle events. These are instead multiplexed over a single RPC call and message (StateChange, StateChangeEvent), lacking any event-specific returned data and making any future extension unnecessarily difficult. Other, smaller annoyances are the usage of a single Empty response type in multiple other RPC calls, with similar negative side effects.

Addressing these alone is a more intrusive protocol change than the usual addition of new NRI inputs or controls and the related protocol extension with new fields. Therefore this PR defines a new v1beta1 revision of the protocol, initially identical to v1alpha1, which is then updated address the aforementioned annoyances.

Some of the original protocol problems are also reflected in the stub and the plugin/stub API. This PR tries to clean some of those up as well. It defines a new v1beta1 stub, turning the original one into a v1alpha1 revision.

In an attempt to provide a painless transition path for plugins (and to experiment with it in practice) this PR also implements an internal transparent protocol conversion from v1alpha1 to v1beta1, for plugins which register themselves talking the earlier version of the protocol.

This PR is still work-in-progress, but I'd like to gather some initial thoughts and general feedback from anybody who has the time and interest to take a look at this, already in its current state.

Here is the list of things (I think) this PR addresses in v1beta1:

eliminate on wire StateChange in favor of dedicated event-specific requests
eliminate remaining usage of Empty (plugin registration, plugin shutdown, wasm plugin log bridging)
implement plugin shutdown, use it where it makes sense instead of just closing the connection
delay closing plugin connections for fatal errors, giving error propagation back to plugins a chance

Here is a list of things I either know are still missing or I'd like to additionally address. This is probably still incomplete. I'll try to update it as I can.

[ ] adaptation test suite tests for protocol conversion
[ ] documentation updates
[ ] additional stub/plugin API cleanups

@mikebrow @chrishenzie @marquiz If you have some BW, PTAL.

Oct 01 '25 16:10 klihub

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

Oct 01 '25 17:10 mikebrow

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

Okay, let's have a chat about this next week.

Oct 01 '25 18:10 klihub

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

@mikebrow @chrishenzie Some context about the motivation to ditch once and for all StateChange on the wire and replace it with proper dedicated services and related request/response messages. Our API both on the runtime and the client side has never exposed StateChange. Instead we try to pretend that there are dedicated calls and messages for each one multiplexed over StateChange. In practice what happens is that while both the runtime and the plugins see, for instance, something like RunPodSandox(pod), StopPodSandbox(pod), PostCreateContainer(pod, container), StartContainer(pod, container), in reality we pass this over on the wire with something like

&StateChangeEvent{
    Event: Event_{RUN, STOP, POST_UDPATE,REMOVE}_POD_SANDBOX,
    Pod: pod,
    Container: nil,
}
&StateChangeEvent{
    Event: Event_{POST_CREATE,START,POST_START,POST_UPDATE,REMOVE}_CONTAINER,
    Pod: pod,
    Container: container,
}

while on the return path we always have

type (
    {Run,Stop,PostUpdate,Remove}PodSandboxResponse = Empty
    {PostCreate,Start,PostStart,PostUpdate,Remove}ContainerResponse = Empty
    Empty struct {}
)

This was a futile and pretty misguided attempt, by me in momentary lapse of reason/worse than usual brain damage, to try and keep the protocol looking (slightly) small(er than otherwise). IMO, it did not help or make much sense at that time, and it does even less now. The only thing it causes is problems, and artificially created ones at that.

The initial collective property for all requests mutiplexed as StateChange was that they all had event-like semantics, IOW the plugin/client could not request altering anything in response. A secondary one was that, initially, all these requests had either (PodSandbox) or (PodSandbox, Container) inputs, so the 'union content' approach for StateChangeEvent did not look too ugly. But this already became a problem when @chrishenzie was adding the UpdatePodSandbox request. That, too, was event-like in this sense, but the inputs already did not fit into StateChangeEvent. So there was a dilemma whether to 1) go down the path of further uglification by adding Overhead *LinuxContainerResource, Resources *LinuxContainerResources to the 'union of inputs' or 2) handle it with a proper dedicate request and messages and blow the rest of StateChange apart in a similar fashion(... eventually, while living with the now inconsistent approach for the time being).

We decided to go with the second option, but haven't blown apart StateChange yet. I'd like to do that now, while we're rolling a next wire protocol version as part of this PR.

Oct 09 '25 07:10 klihub

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

@mikebrow @chrishenzie Some context about the motivation to ditch once and for all StateChange on the wire and replace it with proper dedicated services and related request/response messages. Our API both on the runtime and the client side has never exposed StateChange. Instead we try to pretend that there are dedicated calls and messages for each one multiplexed over StateChange. In practice what happens is that while both the runtime and the plugins see, for instance, something like RunPodSandox(pod), StopPodSandbox(pod), PostCreateContainer(pod, container), StartContainer(pod, container), in reality we pass this over on the wire with something like
&StateChangeEvent{
    Event: Event_{RUN, STOP, POST_UDPATE,REMOVE}_POD_SANDBOX,
    Pod: pod,
    Container: nil,
}
&StateChangeEvent{
    Event: Event_{POST_CREATE,START,POST_START,POST_UPDATE,REMOVE}_CONTAINER,
    Pod: pod,
    Container: container,
}
while on the return path we always have
type (
    {Run,Stop,PostUpdate,Remove}PodSandboxResponse = Empty
    {PostCreate,Start,PostStart,PostUpdate,Remove}ContainerResponse = Empty
    Empty struct {}
)
This was a futile and pretty misguided attempt, by me in momentary lapse of reason/worse than usual brain damage, to try and keep the protocol looking (slightly) small(er than otherwise). IMO, it did not help or make much sense at that time, and it does even less now. The only thing it causes is problems, and artificially created ones at that.

The initial collective property for all requests mutiplexed as StateChange was that they all had event-like semantics, IOW the plugin/client could not request altering anything in response. A secondary one was that, initially, all these requests had either (PodSandbox) or (PodSandbox, Container) inputs, so the 'union content' approach for StateChangeEvent did not look too ugly. But this already became a problem when @chrishenzie was adding the UpdatePodSandbox request. That, too, was event-like in this sense, but the inputs already did not fit into StateChangeEvent. So there was a dilemma whether to 1) go down the path of further uglification by adding Overhead *LinuxContainerResource, Resources *LinuxContainerResources to the 'union of inputs' or 2) handle it with a proper dedicate request and messages and blow the rest of StateChange apart in a similar fashion(... eventually, while living with the now inconsistent approach for the time being).

We decided to go with the second option, but haven't blown apart StateChange yet. I'd like to do that now, while we're rolling a next wire protocol version as part of this PR.

Thanks for the detailed explanation. SGTM!!

events is a very overloaded term.. Calling them something else might help :-)

Oct 09 '25 14:10 mikebrow