flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

WIP: replace cleanup scriptlets with shutdown script

Open garlick opened this issue 1 year ago • 3 comments

This makes some improvement to the shutdown process:

  • eliminate the flux admin cleanup-push method of registering cleanup tasks in rc1, and instead have a shutdown script similar to other rc scripts
  • add flux shutdown --force which causes the shutdown script to be called with the -f option
  • in the shutdown script, add a 10s timeout to flux queue idle when -f is present
  • in the shutdown script, unload the perilog module before canceling jobs to skip the epilog (@grondo's idea - a good one IMHO)

Todo:

  • tests need to be written
  • document in man pages and admin guide
  • consider running flux shutdown --force in systemctl stop flux
  • I think flux-accounting's use of the "job" test personality is going to be a problem now that flux admin is removed so deal with that

I thought maybe a pause at this point to get feedback would be good. This had a bigger impact on tests than I was expecting, so maybe I took a wrong turn somewhere.

garlick avatar Apr 05 '24 02:04 garlick

Fixed some stuff

  • add shutdown script to dist manifest (whoops)
  • dropped commits that remove flux admin cleanup-push
  • temporarily add backwards compat code for cleanup scripts (avoids breaking flux-coral2 and flux-accounting)

garlick avatar Apr 05 '24 15:04 garlick

I dropped the actual changes to the shutdown process from this PR (including shutdown --force) leaving only the switch from cleanup-push to a shutdown script since that change alone is a pretty big one.

I think the script is going to be far more maintainable than the "cleanup scriptlets" pushed by rc1, and it opens the possibility of letting flux shutdown pass options to the script. For example, we may want to implement an option for shutdown at a future time that gives jobs more of a soft warning so they can checkpoint. Maybe the --force option should be reserved for when we don't automatically cancel running jobs at shutdown because they can be recovered.

A downside is that this impacts the tests that set rc1,rc3, including some in other projects. (Thus cleanup-push remains as an alias until the other projects can transition)

garlick avatar Apr 09 '24 16:04 garlick

Codecov Report

Merging #5860 (587573b) into master (a7d5bb5) will decrease coverage by 0.05%. Report is 2 commits behind head on master. The diff coverage is 57.89%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5860      +/-   ##
==========================================
- Coverage   83.31%   83.27%   -0.05%     
==========================================
  Files         514      514              
  Lines       82801    82827      +26     
==========================================
- Hits        68986    68974      -12     
- Misses      13815    13853      +38     
Files Coverage Δ
src/common/libflux/conf.c 85.40% <ø> (ø)
src/broker/state_machine.c 80.91% <50.00%> (+0.73%) :arrow_up:
src/broker/runat.c 73.61% <62.50%> (-6.05%) :arrow_down:
src/broker/broker.c 77.92% <56.25%> (-0.27%) :arrow_down:

... and 14 files with indirect coverage changes

codecov[bot] avatar Apr 17 '24 20:04 codecov[bot]