nri icon indicating copy to clipboard operation
nri copied to clipboard

sync sandboxes and containers after starting the pre-installed plugins

Open Iceber opened this issue 2 years ago • 10 comments

For plugins registered with nri.sock, NRI will send the full set of pods and containers after starting. https://github.com/containerd/nri/blob/83295992ed466e5589a1a83ba3ad2bd6b6feb199/pkg/adaptation/adaptation.go#L405-L430

For pre-installed plugins, it is still necessary to synchronize the sandboxes and containers data to the plugin after launch,otherwise the plugin will never be able to get information about existing resources

Iceber avatar May 18 '23 06:05 Iceber

Codecov Report

Patch coverage: 27.27% and project coverage change: -0.08 :warning:

Comparison is base (2a8b655) 63.83% compared to head (8fa7ae9) 63.75%.

:exclamation: Current head 8fa7ae9 differs from pull request most recent head 75ad210. Consider uploading reports for the commit 75ad210 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #43      +/-   ##
==========================================
- Coverage   63.83%   63.75%   -0.08%     
==========================================
  Files           9        9              
  Lines        1800     1810      +10     
==========================================
+ Hits         1149     1154       +5     
- Misses        500      503       +3     
- Partials      151      153       +2     
Impacted Files Coverage Δ
pkg/adaptation/adaptation.go 70.27% <27.27%> (-1.62%) :arrow_down:

... and 1 file with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar May 18 '23 06:05 codecov-commenter

@klihub I changed the call to syncFn after each pre-install plugin launch to only call syncFn once after all plugins are launched, which avoids unnecessary multiple calls to syncFn in the case of a large number of plugins https://github.com/containerd/nri/compare/104c8e844981be230d53ffac4f9eee5d1db0634e..6f7af44d11899f94a6f619bab0bba6b012bd8e4e

PTAL, Thanks

Iceber avatar May 18 '23 10:05 Iceber

@klihub I changed the call to syncFn after each pre-install plugin launch to only call syncFn once after all plugins are launched, which avoids unnecessary multiple calls to syncFn in the case of a large number of plugins https://github.com/containerd/nri/compare/104c8e844981be230d53ffac4f9eee5d1db0634e..6f7af44d11899f94a6f619bab0bba6b012bd8e4e

PTAL, Thanks

True, better to do it that way. LGTM.

klihub avatar May 18 '23 10:05 klihub

Can we merge this as is ? It does fix a blocking bug

champtar avatar Mar 16 '24 00:03 champtar

ping @mikebrow @dmcgowan this PR is almost 1 year old

champtar avatar May 10 '24 14:05 champtar

@klihub let's chat with @Iceber if still interested .. maybe at the next or subsequent containerd call

mikebrow avatar May 22 '24 22:05 mikebrow

Hi @mikebrow @klihub Do you have any new ideas for this pr?

I think synchronizing all and reporting back all errors is a good way which gives a more complete picture of the errors that may occur

Of course if still think that skipping and printing warnings is a good way to go, I'll adopt it

Also for more complex configurations my thoughts are:

  1. pre-installed binaries are required by default, I don't think it's necessary for the user to place a bunch of plugin binaries in the directory that don't care if they are successfully launched or not
  2. you can enable/disable some pre-installed plugins by configuring the plugin name, so that only some plugins can be started, or some plugins won't be started at all I think this feature can be placed in another pr

ci fails from errors.Join, It comes from 1.20, we might consider upgrading go.mod to 1.20

Iceber avatar Jun 04 '24 11:06 Iceber

ci fails from errors.Join, It comes from 1.20, we might consider upgrading go.mod to 1.20

I filed PR #88 to bump the minimum golang requirement to 1.20. It should be fine for all of our downstream packages. You can then rebase this PR if #88 gets merged.

klihub avatar Jun 04 '24 17:06 klihub

Hi @mikebrow @klihub Do you have any new ideas for this pr?

I think synchronizing all and reporting back all errors is a good way which gives a more complete picture of the errors that may occur

I think synchronizing pre-launched plugins (pre-installed plugins in your terminology) would be essential to make sure that we have an identical handshake/initial message flow, regardless of whether a plugin is pre-launched or not. I made a mistake and that is why I completely overlooked that pre-launched plugins do not get properly synchronized during registration. It was not intentional. It would be better to send an empty sync message than no message at all. There are packages out there which do expect the synchronization message to come before they consider themselves ready to start processing other NRI messages.

That said, I think synchronizing pre-launched plugins is only/really relevant for the case when an already active runtime (with existing containers) gets restarted (after all, on initial startup the set of pods and containers to sync with the plugins is empty). So a related question is whether the runtime itself is able to properly determine/rediscover the state of pods and containers by the time pre-launched plugins are being started and what to do if it is not. And this needs to be checked/tested both with containerd and cri-o because they might behave differently. I have a vague recollection that containerd itself would behave slightly differently for 2.0 than 1.7, but I haven't had time right no to check/test this in practice.

klihub avatar Jun 04 '24 17:06 klihub

That said, I think synchronizing pre-launched plugins is only/really relevant for the case when an already active runtime (with existing containers) gets restarted (after all, on initial startup the set of pods and containers to sync with the plugins is empty). So a related question is whether the runtime itself is able to properly determine/rediscover the state of pods and containers by the time pre-launched plugins are being started and what to do if it is not. And this needs to be checked/tested both with containerd and cri-o because they might behave differently. I have a vague recollection that containerd itself would behave slightly differently for 2.0 than 1.7, but I haven't had time right no to check/test this in practice.

yeah, this is very important, and can be very bad if the synchronization is incomplete with pods and containers.

In containerd 2.0, the recover function initializes the data before the nri is initialized. https://github.com/containerd/containerd/blob/45bc430dd12539ee93f908750ce24e52e1122e9a/internal/cri/server/service.go#L257-L300

Iceber avatar Jun 06 '24 09:06 Iceber

@Iceber wdyt about the logging changes..

mikebrow avatar Sep 18 '24 22:09 mikebrow

@Iceber wdyt about the logging changes..

Sorry I missed the email on this pr, I'll take care of it soon

Iceber avatar Sep 19 '24 01:09 Iceber

@mikebrow I've added a new commit for better viewing of the changes

I've returned plugins that failed to start to syncFn via syncPlugins. after all, those plugins that didn't start are also sync failures/unsyncable plugins

Iceber avatar Sep 19 '24 08:09 Iceber

@klihub did you want to take one more look with the additional changes?

mikebrow avatar Sep 20 '24 16:09 mikebrow

Yes, I'll do that.

klihub avatar Sep 20 '24 16:09 klihub

Yes, I'll do that.

It'll take me a bit more time though. I'm not at the keyboard ATM.

klihub avatar Sep 20 '24 17:09 klihub

ping @Iceber Apart from a few small nits it looks very good.

klihub avatar Sep 21 '24 07:09 klihub

It looks pretty good. I spotted a few wrongly formatted log messages (error wrapping vs. default value format %verb), and one oddly returned (value, error) combo.

Then I have a question to @mikebrow about what we should do if a pre-installed plugin failed to start up or get synchronized. The original code errors out and this PR keeps that behavior intact. But now I started to have second thoughts about that. Do we want to prevent the runtime from starting up in such a case, or just log errors, ignore failing plugins and keep going ?

If we get the few log formatting problems fixed, maybe also change that odd return value combo to the standard convention of 'return a nil-value with a non-nil error', then I think this is good to go in.

But if @mikebrow is of the opinion that we should err on the side of caution and ignore failing plugins instead of preventing the runtime from starting up, then we should also change that behavior.

In r.next 1.1 I'd like us to consider adding plugin config to describe ordering(after:plugin)/dependencies(requires:plugin)/runtime required(MUST) type information that we would use to order the plugin bring up and whether or not a plugin is required to run a container/pod which would give us a good reason, for example, to not report CRI plugin success and thus block pods on the node. Until then I'd prefer to log errors and continue for failing plugins vs bring the runtime down when there is an unknown / unhandled issue with a plugin.

We simply can't know if the plugin is required always or if it is only required for certain nodes or if the error is related to some unknown resource device dependency that is only met on some of the worker nodes. This gives the admin the ability to test out deployment of new plugins on a running machine, remove the failing ones report logs etc.

Would be easy enough to add the MUST run flag now if you like. Dependency ordering stuff would be harder.

mikebrow avatar Sep 23 '24 17:09 mikebrow

In r.next 1.1 I'd like us to consider adding plugin config to describe ordering(after:plugin)/dependencies(requires:plugin)/runtime required(MUST) type information that we would use to order the plugin bring up and whether or not a plugin is required to run a container/pod which would give us a good reason, for example, to not report CRI plugin success and thus block pods on the node. Until then I'd prefer to log errors and continue for failing plugins vs bring the runtime down when there is an unknown / unhandled issue with a plugin.

We simply can't know if the plugin is required always or if it is only required for certain nodes or if the error is related to some unknown resource device dependency that is only met on some of the worker nodes. This gives the admin the ability to test out deployment of new plugins on a running machine, remove the failing ones report logs etc.

Sounds like a good plan to me.

@Iceber Could you update the PR to still log errors from plugin startup or synchronization failures, but return no error so we don't prevent the runtime from starting up.

klihub avatar Sep 23 '24 18:09 klihub

@mikebrow @klihub updated to print logs only, PTAL, Thanks

Iceber avatar Sep 24 '24 06:09 Iceber