flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

content: require backing store for checkpoint

Open chu11 opened this issue 1 year ago • 1 comments

Problem: A backing store is required for content.flush but it is not required for content.checkpoint-put. This is inconsistent and can lead to checkpointing problems done the line.

Require content.checkpoint-put to only work if there is a backing store available. As a consequence, remove code that handled "cached" checkpoints when a backing store is not available.

chu11 avatar Sep 03 '24 23:09 chu11

rebased and re-pushed, just didn't want this series of fixes to be forgotten :-) (along w/ #6240 and #6260)

chu11 avatar Dec 18 '24 18:12 chu11

I guess one caveat to this change is that, if running without a backing store, you could no longer reload the KVS module without losing all the data. Do we care about that?

garlick avatar Dec 18 '24 18:12 garlick

I guess one caveat to this change is that, if running without a backing store, you could no longer reload the KVS module without losing all the data. Do we care about that?

I think it's ok?

The reason for this fix was simply the inconsistency. a content-flush + checkpoint-put is (will always be?) a combination to be done. One of those can work without a backing store right now, but the other one can't. So it sort of doesn't make sense.

So presumably the alternative would be to try and support content-flush and checkpoint-put if a backing store isn't loaded?

chu11 avatar Dec 18 '24 19:12 chu11

So what happens if you call flux_kvs_commit(FLUX_KVS_SYNC) or kvs_checkpoint_commit() without a backing store?

I'm open to requiring a backing store if that cleans things up a bit, e.g. makes those functions work as they should for all configurations. I'm not sure we found there was much benefit to running without one and if I recall we have the ability to tell content-sqlite to run from memory only.

garlick avatar Jan 31 '25 21:01 garlick

So what happens if you call flux_kvs_commit(FLUX_KVS_SYNC) or kvs_checkpoint_commit() without a backing store?

FLUX_KVS_SYNC will fail, b/c content.flush fails when there is no backing store. A FLUX_KVS_SYNC is a content.flush followed by a checkpoint_commit().

kvs_checkpoint_commit() right now can work w/o a backing store, basically caching the checkpoint in memory.

The inconsistency between content.flush and checkpoint-put is what started this and the follow up PRs. Basically we want to always content.flush + checkpoint-put, yet one will fail without a backing module and the other won't. In addition, I think that checkpoint-put not requiring a backing store is sort of the opposite of what you would think checkpoint-put does.

chu11 avatar Jan 31 '25 23:01 chu11

Let's get this in and we can decide where we want to go from here.

Sounds good to me.

For those following along, we've already had some thoughts / discussion on this #6630

chu11 avatar Apr 09 '25 17:04 chu11

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 83.83%. Comparing base (b9ea17a) to head (17ff530). Report is 238 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6255   +/-   ##
=======================================
  Coverage   83.83%   83.83%           
=======================================
  Files         535      535           
  Lines       89397    89313   -84     
=======================================
- Hits        74942    74873   -69     
+ Misses      14455    14440   -15     
Files with missing lines Coverage Δ
src/modules/content/cache.c 85.43% <ø> (-0.03%) :arrow_down:
src/modules/content/checkpoint.c 77.86% <100.00%> (-1.16%) :arrow_down:

... and 7 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jul 08 '25 18:07 codecov[bot]