cni icon indicating copy to clipboard operation
cni copied to clipboard

State sync/Garbage collection for plugins

Open jojimt opened this issue 7 years ago • 6 comments

It is possible for a plugin to restart independent of the run time (e.g. to recover from errors). Because the CNI does not provide a way for a plugin to sync its state upon restart, plugins are forced to resort to out-of-band mechanisms to catch up to any state updates that occurred during its down time. Apart from the restart, there could be other scenarios as well where the state could get out of sync. For completeness, it would be better to enhance CNI to include an up-call that addresses this need.

Comments appreciated.

jojimt avatar Nov 29 '16 15:11 jojimt

You're suggesting a very significant departure from the existing CNI model.

Can you describe your use case in more detail? What sort of state would you imagine needs to be sync'd on what kinds of events?

rosenhouse avatar Jan 20 '17 19:01 rosenhouse

Yes, I do realize it is significant. This is the use case: The plugin is temporarily unavailable (e.g. recovering from a crash). Now, run time calls in with a DEL because a container is deleted. This call will naturally return an error. However, most run times in this scenario would ignore the error and proceed with the delete, leaving stale entries/resources in the plugin.

There are probably two ways to address this: 1. Have runtimes fail the delete when the plugin returns an error 2. Have a sync mechanism as proposed here

It is tempting to choose option 1 as it keeps the interface simpler. However, from the caller's perspective, rolling back a delete which is probably partially done already, is hard. Hence I am proposing 2.

jojimt avatar Jan 24 '17 16:01 jojimt

@jojimt

Thanks for the use-case description. That makes a lot of sense, but I'd question whether a sync mechanism is the best solution.

As an alternative, I would propose option 1, but I wouldn't expect a "roll back" of a failed delete. Instead I would expect that the runtime retry failed deletes. This is consistent with the general principle that a client (runtime) should retry when a service (plugin) is temporarily unavailable.

This seems related to #309. Perhaps plugins can return a special error code when indicating that their backing system is unavailable. That special code could suggest that the caller (runtime) should retry.

rosenhouse avatar Jan 24 '17 17:01 rosenhouse

If the retry mechanism can be implemented, that would address this use case. It still leaves out the case where runtime itself restarts with pending retries. Would it be supposed to remember the partially deleted state?

jojimt avatar Jan 24 '17 18:01 jojimt

We're veering away from CNI and into runtime design questions, but in short, yes.

I wouldn't call it "partially deleted state" though. I would describe it as "containers that shouldn't exist, but the runtime hasn't yet confirmed their deletion." The runtime should maintain a consistent on-disk store, which it updates only on successful creation or successful deletion of containers. The runtime is responsible for converging that to the desired state. Calling CNI DEL is part of the convergence.

I can't speak to how other systems work, but the container orchestrator we use inside of Cloud Foundry does this. I would expect other runtimes to do the same.

rosenhouse avatar Jan 24 '17 21:01 rosenhouse

We're encountering some problem in OpenShift where the host-local IPAM plugin is slowly leaking addresses (like, a few per month on the multi-thousand-user OpenShift Online free starter tier). Clearly this points to some bug somewhere in kubelet, but it's apparently very edge-casey, and there's no obvious indication when it happens--we just notice a stale file in /var/lib/cni/networks at some point long after the fact--so it's not easily debugged. (They seem to happen in groups though; current guess is that it happens when docker/openshift is restarted (or crashes?), and then somehow kubelet, docker, and CNI get out of sync with what containers they know about.)

We can do a GC manually, but it requires completely breaking all of host-local's abstractions (ie, knowing what directory it's storing its data in, and knowing what the contents of those files mean, etc). It would be nicer if we could just tell the plugin, somehow, "here are all the containers we know about, so if you thought you knew about any others, forget about them". I guess if CNI didn't know about some of the containers that kubelet knew about then we'd want to fix that too. So maybe it would look something like:

SYNC CNI_CONTAINERIDS=["123456789", "456789123", "abcdef012"]
ADD CNI_SYNC=true CNI_CONTAINERID=123456789 CNI_NETNS=/proc/pid/123/ns/net ...
ADD CNI_SYNC=true CNI_CONTAINERID=456789123 CNI_NETNS=/proc/pid/456/ns/net ...
ADD CNI_SYNC=true CNI_CONTAINERID=abcdef012 CNI_NETNS=/proc/pid/789/ns/net ...

Plugins that wanted to ignore syncs could just ignore the SYNC command and any ADD that had CNI_SYNC=true set. (Alternatively, the ADDs could be UPDATEs, #89.)

Doing this at kubelet startup would also be nice for us for other reasons. (We currently already do a loop over running pods at startup, in case any pods need to be updated for anything that changed while the kubelet wasn't running. Eg, new routes, or changed VNID tags in OVS flows. Having a real "sync" operation would make that less of a special case.)

cc @dcbw

danwinship avatar Dec 12 '17 16:12 danwinship