flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

etc: support content.backing-module=none

Open chu11 opened this issue 3 years ago • 6 comments

Per discussion in #4267, always loading the backing module content-sqlite can lead to performance issues, especially for shorter lived single user instances. Default to not loading a backing module.

  • to enable this, cache checkpoint put/get in the broker even when a backing module does not exist. This allows a number of checkpoint operations to still work even if a backing module isn't loaded.
  • update rc1/rc3 to not load the checkpoint module by default
    • update systemd service file to default configure loading content-sqlite
    • update tests that require content-sqlite to specifically configure for it

one possible remaining gotcha is that if the user doesn't set a backing module, they are still allowed to load one and that becomes the backing module. Dunno if we'd like to support a special backing module config of "none" (or equivalent word) to say "do not allow a backing module to be loaded"?

chu11 avatar Aug 13 '22 05:08 chu11

In #4267

I'm not sure about making this is the default, since there is unexplained job throughput degradation in this mode. However it is not currently possible to even select this mode. Let's allow this issue to be closed once setting content.backing-module=none works as you'd expect.

Could we cut this PR there and save the discussion about making it the default for another time?

garlick avatar Aug 13 '22 14:08 garlick

Could we cut this PR there and save the discussion about making it the default for another time?

ahhh, I missed that. I'll tweak the PR. Should be simpler as a result as many testsuite updates don't need to happen now.

chu11 avatar Aug 13 '22 14:08 chu11

One other thought is would it make sense to split the broker checkpoint stuff off to another source file? content-cache.c is already long and complex and this seems to be somewhat disjoint.

garlick avatar Aug 13 '22 14:08 garlick

re-pushed

  • split out content checkpoint code into another file
    • the "context' that the content-checkpoint API creates is actually stored within struct content_cache and not struct broker. I didn't initially intend for this, but content-cache needs to call content-checkpoint at one point, so I felt it better to do it this way.
  • rc scripts do not load none by default now

chu11 avatar Aug 16 '22 04:08 chu11

rc scripts do not load none by default now

You are probably already on it, but just in case: don't forget to update PR description if it no longer describes the overall change.

grondo avatar Aug 16 '22 14:08 grondo

Edit: just realized this checkpoint service is created on all ranks, but I think it should only be created on rank 0? If for some reason it is accessed from other ranks, it should just let the broker forward the requests to rank 0 (which is what happens if the message handlers are only registered on rank 0).

Oh good catch, although I think we want to load the checkpoint service all of the time. We just don't want the checkpoint caching service used on rank != 0. This was actually a bug back in #4463, ENOSYS should only be returned when the backing module isn't loaded on rank == 0.

chu11 avatar Aug 22 '22 20:08 chu11

re-pushed, addressing all of the comments above except the discussion about if content.flush should be an error or not. Of particular note:

  • created a bunch of new functions to deal with the cleanup paths better. Those mem-leaks / double frees were embarrasing :P
  • add some more tests per comments above (reload kvs, forwarding from rank != 0 works as intended)

chu11 avatar Aug 22 '22 21:08 chu11

It might be a good idea to pause and question whether the "flush" operation should fail when there is no backing store. Maybe we should redefine it as "flush to backing store, if any"? I know that would require other content tests to be updated. I'm not sure if that's the right answer or not, but if it is, it could also enable the "startlog" stuff to remain in the rc files, and an instance that is restarted several times using dump/restore could have a startlog like one that restarts with a backing store.

Hmmmm, my initial feeling is that when someone does content.flush there should be an error returned. Presumably you'd want to know that things weren't backed up properly?

Here's a thought, could content.flush take an argument that is something like --quiet? i.e. don't return an error if the backing isn't there?

Alternately, we could simply add some options to startlog, restore, etc. to not flush when the option is set, and we could set the option when backing module == none.

Edit: not necessarily for this PR, could be a follow up one

chu11 avatar Aug 22 '22 21:08 chu11

argh, re-pushed, fixed up a mem-leak and fixed some bash-isms in my tests that were affecting the CI

chu11 avatar Aug 23 '22 00:08 chu11

Codecov Report

Merging #4492 (e9cb6c8) into master (11b0680) will decrease coverage by 0.02%. The diff coverage is 77.43%.

:exclamation: Current head e9cb6c8 differs from pull request most recent head 08ffb5f. Consider uploading reports for the commit 08ffb5f to get more accurate results

@@            Coverage Diff             @@
##           master    #4492      +/-   ##
==========================================
- Coverage   83.36%   83.34%   -0.03%     
==========================================
  Files         401      402       +1     
  Lines       67649    67771     +122     
==========================================
+ Hits        56397    56481      +84     
- Misses      11252    11290      +38     
Impacted Files Coverage Δ
src/broker/content-checkpoint.c 76.71% <76.71%> (ø)
src/broker/content-cache.c 85.83% <100.00%> (+0.09%) :arrow_up:
src/modules/content-files/content-files.c 77.43% <0.00%> (-1.83%) :arrow_down:
src/modules/job-archive/job-archive.c 62.62% <0.00%> (-0.70%) :arrow_down:
src/modules/job-info/guest_watch.c 76.21% <0.00%> (-0.55%) :arrow_down:
src/cmd/builtin/restore.c 87.50% <0.00%> (-0.38%) :arrow_down:
src/common/libsubprocess/subprocess.c 87.89% <0.00%> (-0.30%) :arrow_down:
src/modules/kvs/kvs.c 70.46% <0.00%> (-0.14%) :arrow_down:
src/broker/overlay.c 86.39% <0.00%> (-0.11%) :arrow_down:
src/cmd/flux-job.c 87.50% <0.00%> (+0.12%) :arrow_up:
... and 5 more

codecov[bot] avatar Aug 23 '22 04:08 codecov[bot]

and re-pushed, fixing a s3 CI issue

chu11 avatar Aug 23 '22 05:08 chu11

re-pushed, building on top of #4519, adjust some tests as a result of #4519

chu11 avatar Aug 24 '22 20:08 chu11

rebased and re-pushed now that #4519 is merged, this PR is a lot smaller now :-)

chu11 avatar Aug 25 '22 17:08 chu11

re-started one builder that had a ton of failures not related to this PR. assumption is workflow/container borked and affected bunch of tests.

chu11 avatar Aug 27 '22 02:08 chu11

rebased & re-pushed, mergifyio seemed to get stuck on something

chu11 avatar Aug 27 '22 03:08 chu11