etc: support content.backing-module=none
Per discussion in #4267, always loading the backing module content-sqlite can lead to performance issues, especially for shorter lived single user instances. Default to not loading a backing module.
- to enable this, cache checkpoint put/get in the broker even when a backing module does not exist. This allows a number of checkpoint operations to still work even if a backing module isn't loaded.
- update rc1/rc3 to not load the checkpoint module by default
- update systemd service file to default configure loading
content-sqlite - update tests that require content-sqlite to specifically configure for it
- update systemd service file to default configure loading
one possible remaining gotcha is that if the user doesn't set a backing module, they are still allowed to load one and that becomes the backing module. Dunno if we'd like to support a special backing module config of "none" (or equivalent word) to say "do not allow a backing module to be loaded"?
In #4267
I'm not sure about making this is the default, since there is unexplained job throughput degradation in this mode. However it is not currently possible to even select this mode. Let's allow this issue to be closed once setting
content.backing-module=noneworks as you'd expect.
Could we cut this PR there and save the discussion about making it the default for another time?
Could we cut this PR there and save the discussion about making it the default for another time?
ahhh, I missed that. I'll tweak the PR. Should be simpler as a result as many testsuite updates don't need to happen now.
One other thought is would it make sense to split the broker checkpoint stuff off to another source file? content-cache.c is already long and complex and this seems to be somewhat disjoint.
re-pushed
- split out content checkpoint code into another file
- the "context' that the
content-checkpointAPI creates is actually stored withinstruct content_cacheand notstruct broker. I didn't initially intend for this, butcontent-cacheneeds to callcontent-checkpointat one point, so I felt it better to do it this way.
- the "context' that the
- rc scripts do not load
noneby default now
rc scripts do not load none by default now
You are probably already on it, but just in case: don't forget to update PR description if it no longer describes the overall change.
Edit: just realized this checkpoint service is created on all ranks, but I think it should only be created on rank 0? If for some reason it is accessed from other ranks, it should just let the broker forward the requests to rank 0 (which is what happens if the message handlers are only registered on rank 0).
Oh good catch, although I think we want to load the checkpoint service all of the time. We just don't want the checkpoint caching service used on rank != 0. This was actually a bug back in #4463, ENOSYS should only be returned when the backing module isn't loaded on rank == 0.
re-pushed, addressing all of the comments above except the discussion about if content.flush should be an error or not. Of particular note:
- created a bunch of new functions to deal with the cleanup paths better. Those mem-leaks / double frees were embarrasing :P
- add some more tests per comments above (reload kvs, forwarding from rank != 0 works as intended)
It might be a good idea to pause and question whether the "flush" operation should fail when there is no backing store. Maybe we should redefine it as "flush to backing store, if any"? I know that would require other content tests to be updated. I'm not sure if that's the right answer or not, but if it is, it could also enable the "startlog" stuff to remain in the rc files, and an instance that is restarted several times using dump/restore could have a startlog like one that restarts with a backing store.
Hmmmm, my initial feeling is that when someone does content.flush there should be an error returned. Presumably you'd want to know that things weren't backed up properly?
Here's a thought, could content.flush take an argument that is something like --quiet? i.e. don't return an error if the backing isn't there?
Alternately, we could simply add some options to startlog, restore, etc. to not flush when the option is set, and we could set the option when backing module == none.
Edit: not necessarily for this PR, could be a follow up one
argh, re-pushed, fixed up a mem-leak and fixed some bash-isms in my tests that were affecting the CI
Codecov Report
Merging #4492 (e9cb6c8) into master (11b0680) will decrease coverage by
0.02%. The diff coverage is77.43%.
:exclamation: Current head e9cb6c8 differs from pull request most recent head 08ffb5f. Consider uploading reports for the commit 08ffb5f to get more accurate results
@@ Coverage Diff @@
## master #4492 +/- ##
==========================================
- Coverage 83.36% 83.34% -0.03%
==========================================
Files 401 402 +1
Lines 67649 67771 +122
==========================================
+ Hits 56397 56481 +84
- Misses 11252 11290 +38
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/broker/content-checkpoint.c | 76.71% <76.71%> (ø) |
|
| src/broker/content-cache.c | 85.83% <100.00%> (+0.09%) |
:arrow_up: |
| src/modules/content-files/content-files.c | 77.43% <0.00%> (-1.83%) |
:arrow_down: |
| src/modules/job-archive/job-archive.c | 62.62% <0.00%> (-0.70%) |
:arrow_down: |
| src/modules/job-info/guest_watch.c | 76.21% <0.00%> (-0.55%) |
:arrow_down: |
| src/cmd/builtin/restore.c | 87.50% <0.00%> (-0.38%) |
:arrow_down: |
| src/common/libsubprocess/subprocess.c | 87.89% <0.00%> (-0.30%) |
:arrow_down: |
| src/modules/kvs/kvs.c | 70.46% <0.00%> (-0.14%) |
:arrow_down: |
| src/broker/overlay.c | 86.39% <0.00%> (-0.11%) |
:arrow_down: |
| src/cmd/flux-job.c | 87.50% <0.00%> (+0.12%) |
:arrow_up: |
| ... and 5 more |
and re-pushed, fixing a s3 CI issue
re-pushed, building on top of #4519, adjust some tests as a result of #4519
rebased and re-pushed now that #4519 is merged, this PR is a lot smaller now :-)
re-started one builder that had a ton of failures not related to this PR. assumption is workflow/container borked and affected bunch of tests.
rebased & re-pushed, mergifyio seemed to get stuck on something