flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

modules: Move feasibility/satisfiability checking into a new module

Open jacobtkeio opened this issue 1 year ago • 1 comments

Problem: Feasibility checking takes up a significant amount of sched-fluxion-resource's time that could be better spent on scheduling. Additionally, feasibility checking is not RFC 27 compliant. Solution: Move feasibility checking into a new module, sched-fluxion-feasibility, that can run on multiple ranks and make it RFC 27 compliant.

jacobtkeio avatar Aug 30 '24 21:08 jacobtkeio

Are we still OK with leaving in the uncovered code in resource_match.cpp:select_subsystems? https://github.com/jacobtkeio/flux-sched/blob/a80379f98716f28e623f9899b705c4232d04e469/resource/modules/resource_match.cpp#L1234

jacobtkeio avatar Jul 25 '25 20:07 jacobtkeio

I pushed these last two t4014 changes at the suggestion of @wihobbs. They cover a bit of feasibility checking through the job validator and flux submit (issue 1383)

jacobtkeio avatar Jul 29 '25 23:07 jacobtkeio

Are we still OK with leaving in the uncovered code in resource_match.cpp:select_subsystems?

Yes, my preference is to leave it in so that when subsystems are working correctly we won't have to remember to re-add the check.

milroy avatar Aug 05 '25 00:08 milroy

How did you test this? I added the change in this commit to t1020 in current master and the test still passes.

Running feasibility at all will reject

Edit: I accidentally left a review comment rather than commenting on the specific commit. Adding a link to the commit I'm referencing: 7da3e16

Separating feasibility into a new module fixes this.

How did you test this? I added the change in this commit to t1020 in current master and the test still passes.

I'll chime in on this since I originally reported #1383.

Separating feasibility into a new module doesn't fix #1383, running feasibility in the validator at all will fix that issue, as infeasible jobs will be rejected before the job reaches DEPEND.

That issue references a bug I found while testing this PR, that if the job reaches SCHED, a large number of infeasible jobs will cause Fluxion to not schedule anything at all. For an example of that, I posted one in the issue.

After a quick look I think t1020 is running with feasibility enabled in the validator, so perhaps adding an extra test in t4014 isn't necessary. (If I gave you bad advice @jacobtkeio, sorry!) Or perhaps we should add a test of many infeasible jobs with the validator disabled to make sure that the sched loop can reject the jobs if they reach that point (after the issue is solved).

wihobbs avatar Aug 05 '25 01:08 wihobbs

Yes, this is my mistake (you did not give bad advice!) made in haste. I will remove this commit, and in the future I will cherry-pick my testing commits into master to see if they are relevant. Sorry for the hassle!

jacobtkeio avatar Aug 05 '25 16:08 jacobtkeio

This last force push reduced the subject line of 137c67e below 50 characters.

jacobtkeio avatar Aug 11 '25 21:08 jacobtkeio

This last force push reduced the subject line of https://github.com/flux-framework/flux-sched/commit/137c67ed8ffe76f7f0c6b88372848dea0aaf16b3 below 50 characters.

Just an FYI - the 50 character limit should just be a warning -- sometimes it is impossible to make the subject line that small.

grondo avatar Aug 11 '25 22:08 grondo

Since the approvals were dismissed due to a rebase I'll approve and add MWP.

wihobbs avatar Aug 13 '25 16:08 wihobbs

Codecov Report

:x: Patch coverage is 75.49180% with 299 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 77.1%. Comparing base (1a043ff) to head (b5389aa). :warning: Report is 63 commits behind head on master.

Files with missing lines Patch % Lines
resource/modules/resource.cpp 74.4% 219 Missing :warning:
resource/modules/feasibility.cpp 73.5% 56 Missing :warning:
resource/modules/resource_match.cpp 83.3% 24 Missing :warning:
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1285     +/-   ##
========================================
+ Coverage    77.0%   77.1%   +0.1%     
========================================
  Files         111     114      +3     
  Lines       16100   16320    +220     
========================================
+ Hits        12397   12585    +188     
- Misses       3703    3735     +32     
Files with missing lines Coverage Δ
qmanager/modules/qmanager.cpp 74.3% <ø> (+3.9%) :arrow_up:
resource/modules/resource_match.hpp 100.0% <100.0%> (ø)
resource/modules/resource_match.cpp 67.9% <83.3%> (-3.1%) :arrow_down:
resource/modules/feasibility.cpp 73.5% <73.5%> (ø)
resource/modules/resource.cpp 74.4% <74.4%> (ø)

... and 3 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 11 '25 16:11 codecov[bot]