flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

shell: determine node & core resource counts from R during init

Open sam-maloney opened this issue 10 months ago • 6 comments

Problem: counts for resources are currently being parsed only from jobspec, but if/once ranges are used then values cannot be determined.

Modify shell init data flow to use the actual allocation from the scheduler in R, if present, to determine resource counts for nodes & cores in jobspec_parse.

As a first step, this checks whether a resource set struct was passed to jobspec_parse, and otherwise leaves the original codepath unchanged. The slot count must still be parsed from the jobspec regardless, because it is missing from R.

The concept for this is also essentially separated off from #6632, albeit with a fair bit of re-working to keep the changes as minimal as possible. Because flux-sched already has basic support for ranges, this PR already allows jobspec with ranges for nodes and/or cores to run successfully.

sam-maloney avatar Mar 13 '25 08:03 sam-maloney

This is great! The changes look minimal and understandable so I'll try to get a review done soon. Thanks!

grondo avatar Mar 13 '25 14:03 grondo

@grondo perhaps a gentle nudge for a review now that the May release is complete :)

sam-maloney avatar May 08 '25 05:05 sam-maloney

Thanks for the nudge @sam-maloney!

This looks good to me. However, one observation is that it appears the "old" code path (recursive_parse_jobspec_resources () which gathers the counts via jobspec only) is now unused except for the unit tests in src/shell/test/jobspec.c. I wonder if we should just drop that code and update the unit tests to provide an R?

What do you think? It is also possible I misread the code and there is some use case still present outside of the unit test. Even in that case though, it might be nice to have a structure to test with provided simple R/rcalc values.

grondo avatar May 08 '25 13:05 grondo

@grondo thanks for having a look!

After a quick search I actually think you're absolutely correct that the only use of the old code path would now be the test suite; I think my original intent was to keep the changes absolutely minimal, but if you think the approach overall seems sound then I am happy to drop the old code and put in some time to update the unit tests.

Let me know if you have any other thoughts, otherwise I'll ping you when I have an update ready!

sam-maloney avatar May 08 '25 14:05 sam-maloney

@grondo alright, I think I have a working update ready; tests are passing on my machine so hopefully the CI agrees!

I've removed the unused code, and added a simple resource set to each test which matches with the resources requested by the jobspec being tested (or at least something reasonable for tests that are supposed to fail).

I've completely removed a number of the bad tests that were checking for certain disallowed resource configurations, since part of the argument driving this work is that the shell shouldn't really be parsing the jobspec resources but rather leave it to the validator and/or scheduler.

The only previously good test which now fails (and I've also removed) was the "node[2]->socket[3]->slot[5]->core[3]". Again, since the impetus for this is allowing ranges to be specified, I'm intentionally avoiding the assumption that any of the counts are integers (other than for the slots), so having a resource with a non-unity count between the node and slot (if node is given) means there's no easy way to know the multiplier. This (hopefully non-deal-breaking) restriction is documented in the comment above the recursive parsing function.

And fingers crossed that down the road, if/once slot information can be included in R, then the remaining resource parsing can also be dropped entirely!

sam-maloney avatar May 09 '25 15:05 sam-maloney

Apologies @sam-maloney, I dropped the ball here :disappointed:

Once #6930 is merged we can rebase and merge this one too.

grondo avatar Jul 24 '25 16:07 grondo

Apologies @sam-maloney, I dropped the ball here 😞

No worries @grondo; with our project officially starting it's had the paradoxical effect of diverting me from working on flux with lots of meetings and admin work as everything gets going, haha!

sam-maloney avatar Jul 25 '25 13:07 sam-maloney

Codecov Report

:x: Patch coverage is 97.56098% with 1 line in your changes missing coverage. Please review. :white_check_mark: Project coverage is 83.91%. Comparing base (bd1ece4) to head (6ab3719). :warning: Report is 700 commits behind head on master.

Files with missing lines Patch % Lines
src/shell/info.c 66.66% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6705      +/-   ##
==========================================
+ Coverage   83.89%   83.91%   +0.01%     
==========================================
  Files         540      540              
  Lines       90707    90676      -31     
==========================================
- Hits        76099    76091       -8     
+ Misses      14608    14585      -23     
Files with missing lines Coverage Δ
src/shell/jobspec.c 94.73% <100.00%> (+12.99%) :arrow_up:
src/shell/info.c 75.86% <66.66%> (ø)

... and 9 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Oct 26 '25 23:10 codecov[bot]