flux-sched
flux-sched copied to clipboard
test impact of using rv1 vs rv1_nosched on instance performance
In a planning meeting, the idea of running with rv1 match format enabled in production was discussed as a stopgap solution for #991. However, the performance or other impact due to that change was not known. We should characterize any impact due to this configuration so we can make decisions based on results.
Here's a first attempt at a parameter study that investigates different job sizes and counts with rv1
vs rv1_nosched
.
Each test instance runs the script found at the bottom of this comment with different parameters for match-format
, number of jobs, and nodes per job (all jobs allocate nodes exclusively as would be the case for a system instance).
The suite of tests were launched on corona via the following bulksubmit
invocation:
flux bulksubmit -n1 -c4 --watch --shuffle \
--output=results.{seq} \
--env=NJOBS={0} --env=NNODES={1} --env=MATCH_FORMAT={2} \
flux start ./test.sh \
::: 10 100 1000 2000 8000 \
::: 1 10 100 1000 \
::: rv1 rv1_nosched
A couple of the parameter combinations caused an issue that may need to be investigated. The nodes/job=1000 cases for njobs=1000,2000, and 8000 all failed due to running out of space in /var/tmp
. This occurred even thought the content store db was only around a GB because the rank 0 broker RSS was 160GB. Note, in this case errors were logged but the instance just stopped processing jobs.
Other instances reached a maximum RSS of ~2G, so I'm not yet sure what the issue was (more investigation needed!) Perhaps we are caching R in memory somewhere with the full scheduling key intact. Note in the data below a 1000 node exclusive R takes 17MiB - that is one R object :astonished: .
#!/bin/bash
MATCH_FORMAT=${MATCH_FORMAT:-rv1}
NJOBS=${NJOBS:-100}
NNODES=${NNODES:-16}
printf "MATCH_FORMAT=${MATCH_FORMAT} NJOBS=$NJOBS NODES/JOB=$NNODES\n"
flux module remove sched-fluxion-qmanager
flux module remove sched-fluxion-resource
flux module remove resource
flux config load <<EOF
[sched-fluxion-qmanager]
queue-policy = "easy"
[sched-fluxion-resource]
match-format = "$MATCH_FORMAT"
[queues.debug]
requires = ["debug"]
[queues.batch]
requires = ["batch"]
[resource]
noverify = true
norestrict = true
[[resource.config]]
hosts = "test[0-1999]"
cores = "0-47"
gpus = "0-8"
[[resource.config]]
hosts = "test[0-1899]"
properties = ["batch"]
[[resource.config]]
hosts = "test[1900-1999]"
properties = ["debug"]
EOF
flux config get | jq '."sched-fluxion-resource"'
flux module load resource noverify monitor-force-up
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux queue start --all --quiet
flux resource list
t0=$(date +%s.%N)
flux submit -N$NNODES --queue=batch --cc=1-$NJOBS \
--setattr=exec.test.run_duration=1ms \
--quiet --wait hostname
ELAPSED=$(echo $(date +%s.%N) - $t0 | bc -l)
THROUGHPUT=$(echo $NJOBS/$ELAPSED | bc -l)
R_SIZE=$(flux job info $(flux job last) R | wc -c)
OBJ_COUNT=$(flux module stats content-sqlite | jq .object_count)
DB_SIZE=$(flux module stats content-sqlite | jq .dbfile_size)
printf "%-12s %5d %4d %8.2f %8.2f %12d %12d %12d\n" \
$MATCH_FORMAT $NJOBS $NNODES $ELAPSED $THROUGHPUT \
$R_SIZE $OBJ_COUNT $DB_SIZE
Here's the initial results. The rv1_nosched
njobs=8000 nodes/job=1000 case ran out of time after 8hrs.
FORMAT NJOBS SIZE RUNTIME JPS R_SIZE NOBJECTS DB_SIZE
rv1_nosched 10 1 0.67 14.92 215 430 229376
rv1_nosched 10 1000 16.75 0.60 229 536 249856
rv1_nosched 10 10 0.66 15.10 232 480 249856
rv1_nosched 10 100 1.20 8.33 226 504 258048
rv1_nosched 100 1 1.20 83.55 215 3668 2170880
rv1_nosched 100 10 1.64 60.86 226 4207 2441216
rv1_nosched 100 1000 338.28 0.30 229 4983 2899968
rv1_nosched 100 100 86.50 1.16 232 4691 3043328
rv1_nosched 1000 1 8.43 118.68 212 38020 23445504
rv1_nosched 1000 10 803.42 1.24 232 44892 33071104
rv1_nosched 1000 100 1237.58 0.81 232 46465 37584896
rv1_nosched 1000 1000 4014.96 0.25 229 50130 44744704
rv1_nosched 2000 1 21.52 92.95 215 75639 49369088
rv1_nosched 2000 10 1811.92 1.10 226 97349 82919424
rv1_nosched 2000 100 2497.18 0.80 232 99770 86978560
rv1_nosched 2000 1000 8451.37 0.24 229 101053 98111488
rv1_nosched 8000 1 5525.57 1.45 215 335169 293126144
rv1_nosched 8000 100 11872.30 0.67 232 382905 397213696
rv1_nosched 8000 10 9500.90 0.84 232 396622 425652224
FORMAT NJOBS SIZE RUNTIME JPS R_SIZE NOBJECTS DB_SIZE
rv1 10 1 0.70 14.37 18961 476 282624
rv1 10 10 0.99 10.09 184848 552 491520
rv1 10 100 4.81 2.08 1808642 619 2146304
rv1 100 1 1.51 66.37 18961 4174 2850816
rv1 100 10 5.14 19.47 181352 4773 4882432
rv1 10 1000 92.67 0.11 18228323 653 18608128
rv1 100 100 165.15 0.61 1808642 5552 21889024
rv1 1000 1 10.69 93.56 18609 40268 30941184
rv1 2000 1 26.25 76.20 18961 79640 59781120
rv1 1000 10 898.83 1.11 182538 52837 62410752
rv1 2000 10 2069.59 0.97 181352 107071 133500928
rv1 100 1000 1134.78 0.09 18228323 6213 187297792
rv1 1000 100 2127.79 0.47 1820448 58699 232325120
rv1 8000 1 5594.02 1.43 18730 339055 327118848
rv1 2000 100 4472.80 0.45 1808642 113545 470863872
rv1 8000 10 10955.79 0.73 182538 446144 689192960
rv1 8000 100 19206.11 0.42 1791884 459748 2043412480
Ran similar tests without node exclusive matching. All I did was change -N NNODES
to -n 48*NNODES
. Similar results, but a few of the larger rv1
OOMed, so I didn't get as many results. I guess all this shows is that node exclusive scheduling isn't the problem with the very slow scheduling here:
FORMAT NJOBS SIZE RUNTIME JPS R_SIZE NOBJECTS DB_SIZE
rv1_nosched 10 1 1.47 6.81 201 436 233472
rv1_nosched 10 10 1.50 6.68 218 462 241664
rv1_nosched 10 1000 40.49 0.25 215 533 249856
rv1_nosched 10 100 2.11 4.74 212 518 266240
rv1 10 1 1.45 6.89 16140 459 270336
rv1 10 10 1.78 5.61 156764 548 450560
rv1 10 100 5.07 1.97 1533328 617 1843200
rv1_nosched 100 1 9.58 10.44 201 3915 2404352
rv1_nosched 100 10 10.37 9.64 212 4158 2420736
rv1_nosched 100 100 567.36 0.18 218 4659 2691072
rv1_nosched 100 1000 1915.72 0.05 215 5011 2760704
rv1 100 1 9.86 10.14 16140 4146 2822144
rv1 100 10 13.15 7.61 153808 4474 4370432
rv1 10 1000 106.01 0.09 15452800 653 15654912
rv1 100 100 648.12 0.15 1533328 5634 19181568
rv1_nosched 1000 1 92.45 10.82 198 41103 29061120
rv1_nosched 1000 1000 19009.78 0.05 0 36650 31846400
rv1 1000 1 94.45 10.59 15842 41364 32202752
rv1_nosched 1000 10 5944.04 0.17 218 45445 33878016
rv1_nosched 1000 100 8331.68 0.12 212 46789 37441536
rv1 1000 10 6220.77 0.16 154814 49858 58167296
rv1_nosched 2000 1 817.74 2.45 201 86857 65691648
rv1 2000 1 812.90 2.46 16140 87815 70770688
rv1_nosched 2000 10 13523.46 0.15 212 94005 72663040
rv1_nosched 2000 100 17157.62 0.12 218 94769 74760192
rv1 2000 10 13874.15 0.14 153808 101857 122454016
rv1 100 1000 2848.30 0.04 15452800 6160 157343744
rv1 1000 100 9320.08 0.11 1533328 56520 202543104
rv1 2000 100 18836.20 0.11 1543334 114569 417652736
I was going to try this same set of experiments with sched-simple
, but it turns out this test setup won't work with the simple scheduler. This is because the instance is still waiting for ranks 1-1999 to come up, so only one rank's worth of resources are available.
@garlick and I were wondering after the fact how this happens to work with Fluxion.
It turns out that when Fluxion marks "all" resources down, it uses the instance size to mark ranks 0-(size-1)
down. This particular test uses an instance size of 1, so only one resource actually gets set down (rank 0, and then it immediately gets marked up).
https://github.com/flux-framework/flux-sched/blob/32f74d6260c033d32d4af606b60a0eef4c1dbfd7/resource/modules/resource_match.cpp#L1114-L1132
It turns out that when Fluxion marks "all" resources down, it uses the instance size to mark ranks
0-(size-1)
down. This particular test uses an instance size of 1, so only one resource actually gets set down (rank 0, and then it immediately gets marked up).
That sounds like a bug, if ironically a useful one for testing this.
That sounds like a bug, if ironically a useful one for testing this.
Agree. Issue opened: #1040
Ok, figured out the core resource
module can be loaded with the monitor-force-up
option and was able to run these same tests with sched-simple juts for comparison:
FORMAT NJOBS SIZE RUNTIME JPS R_SIZE NOBJECTS DB_SIZE
sched-simple 10 10 0.64 15.55 143 459 233472
sched-simple 10 1 0.61 16.27 133 457 241664
sched-simple 10 100 0.79 12.60 147 512 262144
sched-simple 10 1000 6.84 1.46 143 584 274432
sched-simple 100 1 1.31 76.21 133 4071 2514944
sched-simple 100 10 1.50 66.66 143 4340 2629632
sched-simple 100 100 3.21 31.12 151 4867 3047424
sched-simple 100 1000 70.13 1.43 143 5531 3964928
sched-simple 1000 1 8.84 113.06 133 40972 30142464
sched-simple 1000 10 10.63 94.08 143 44390 32665600
sched-simple 1000 100 28.37 35.25 147 50234 43425792
sched-simple 1000 1000 717.07 1.39 143 55068 46710784
sched-simple 2000 1 16.86 118.59 133 82342 62046208
sched-simple 2000 10 20.46 97.75 143 86788 66076672
sched-simple 2000 100 56.88 35.16 151 100292 91693056
sched-simple 2000 1000 1425.18 1.40 143 110692 111550464
sched-simple 8000 1 69.32 115.40 135 309652 252616704
sched-simple 8000 10 84.06 95.17 143 335359 299745280
sched-simple 8000 100 218.65 36.59 147 377497 396423168
sched-simple 8000 1000 5736.57 1.39 143 448211 602529792
FYI - I edited the test script above to add the monitor-force-up
option to flux module load resource
. This will be required once #1042 is merged.
@grondo this might be a separate issue, but would it be possible to mock the state of nodes too? E.g. that some subset in the list is DOWN? The context here is for bursting - we want to mock nodes that don’t exist as DOWN and then provide some actually existing nodes (so hopefully we can accomplish similar without the entire thing being a mock!)
The default state of nodes is down. Is there a situation where they need to be forced down after having been mocked up or actually up? (like shrinking back down)? Anyway not a scheduler issue per se so I'd suggest opening a flux-core issue (if there is an issue).
If the default state is down, in these examples how do they fake run? Where is the logic happening that allows them to do that?
This option: --setattr=exec.test.run_duration=1ms
says instead of actually running the job, just sleep for a millisecond.
Yes! I derived that from here: https://github.com/flux-framework/flux-core/blob/49c59b57bb99852831745cd4cc1052eb56194365/src/modules/job-exec/testexec.c#L68 but I don't understand how it actually works to allow it to run (and schedule on nodes that don't actually exist) and not just determine that the resources are not available. I think maybe I'm asking about a level of detail deeper than that attribute?
The default state of nodes in the scheduler is supposed to be down until the resource.acquire
protocol says they are up. A bug in Fluxion (discussed above) set only the first node (of 2000 being simulated here) down, then when that broker came up that node was marked UP and all resources then appeared UP through happenstance.
To really force resources up, the resource
module monitor-force-up
option is required.
(assuming that is what you were actually asking about?)
The default state of nodes in the scheduler is supposed to be down until the resource.acquire protocol says they are up. A bug in Fluxion (discussed above) set only the first node (of 2000 being simulated here) down, then when that broker came up that node was marked UP and all resources then appeared UP through happenstance.
Ah so the above wasn't supposed to happen - without the bug the resources would remain down, is that correct? And the reason it was happening is here:
https://github.com/flux-framework/flux-core/blob/b20460a6e8f2ef9938a2e8bab898ff505b39df3a/src/modules/resource/monitor.c#L249-L272
Ok so assuming we set monitor-force-up
and a monitor returns successfully, do the jobs start scheduling on nodes (that are thought to be up) because of the flux_reactor_run
? https://github.com/flux-framework/flux-core/blob/b20460a6e8f2ef9938a2e8bab898ff505b39df3a/src/modules/resource/resource.c#L499. I'm trying to understand how once the node is "up" we get jobs assigned to it, sorry for all the questions.
The resource.acquire protocol is described here: https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_28.html
Basically scheduler starts up and asks resource "what resources do I have"? Resource says here's a pile of nodes/cores whatever and none of them are up. Oh, now two are up. On now four are up. (or all are up). Scheduler is simultaneously receiving "alloc" requests from the job manager asking for a resource allocation for pending jobs. So the scheduler's job is to decide which resources to allocate to the jobs requesting them. It should only allocate the resources that are up of course. Does that help?
So in this case, all the scheduler knows is that the lead broker is up, and the lead broker is said to have all of the resources of the fake nodes (this part here saying that the resource spec can come from a config file, what we did):
This resource set may be obtained from a configuration file, dynamically discovered, or assigned by the enclosing instance
So this response is just going to reflect what we put in the that broker config, and we don't verify any of them (that would be done with hwloc?) because we added :
[resource]
noverify = true
And then because we are in this mock mode, there isn't an actual job run, it just schedules (assigns the job to some fake nodes), waits for the run duration, and then calls it completed? So does the command hostname
matter at all? And then what if you have some set of real resources and some set of fake resources (so a mix of both of those cases?)
I think you got it! The actual command shouldn't matter.
If you mix real and fake resources, the scheduler doesn't know which is which so it'll be fine if you are mocking execution, and sometimes fine and sometimes not if you aren't.
If you mix real and fake resources, the scheduler doesn't know which is which so it'll be fine if you are mocking execution, and sometimes fine and sometimes not if you aren't.
I'll try this next! Thanks for answering my questions!
The default state of nodes is down. Is there a situation where they need to be forced down after having been mocked up or actually up? (like shrinking back down)? Anyway not a scheduler issue per se so I'd suggest opening a flux-core issue (if there is an issue).
This thread, and this question, is probably a good one to chat with @tpatki, @JaeseungYeom and maybe @milroy about too. We've talked about needing to work up a newer-generation simulator for some of the Fractale work if that happens, and this would fall under that pretty neatly.
OK. I feel like we should take it out of this issue though as the original topic is pretty important and the results presented above are significant and deserve some attention.
Maybe open a flux-core issue after discussing requirements? Could be a team meeting topic if needed?
To that point, I'm trying to repro some of this, just to be sure, you got a lot of job-manager/jobtap errors right @grondo? This is what I'm getting for a small one for example:
root@1f9bd98b509c:/workspaces/flux-sched/build# time flux start env NJOBS=10 NNODES=10 MATCH_FORMAT=rv1_nosched ../tes
t_rv1_perf.sh
MATCH_FORMAT=rv1_nosched NJOBS=10 NODES/JOB=10
{
"match-format": "rv1_nosched"
}
STATE QUEUE NNODES NCORES NGPUS NODELIST
free batch 1900 91200 17100 test[0-1899]
free debug 100 4800 900 test[1900-1999]
allocated 0 0 0
down 0 0 0
Jun 27 20:20:00.837395 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837427 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837459 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837555 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837602 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837648 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837683 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837716 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837749 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837776 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.899556 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.901493 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.903428 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.905413 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.907354 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.909447 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.911362 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.913584 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.915916 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.917784 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
flux-job: job history is empty
flux-job: error parsing jobid: "R"
rv1_nosched 10 10 0.29 34.53 0 415 180224
real 0m3.199s
user 0m2.918s
sys 0m0.366s
To that point, I'm trying to repro some of this, just to be sure, you got a lot of job-manager/jobtap errors right @grondo?
No I don't see any of those jobtap errors in my runs. Looks like you are inside a docker container? Let me try in that environment (I was running on corona)
@trws - I can't reproduce the errors above in the latest flux-sched docker container.
I performed the following steps:
-
docker pull fluxrm/flux-sched:latest
-
docker run -ti fluxrm/flux-sched
- paste the test script from https://github.com/flux-framework/flux-sched/issues/1009#issuecomment-1603636498 as
test.sh
-
flux start ./test.sh
Thanks @grondo, I'm trying to get an environment where I can get a decent perf trace anyway, so I'm building one up in a long-term VM (the containers are proving a bit of a problem for getting perf to work, at all). Hopefully that will just take care of it, will see.
@trws if you just need a small setup, the flux operator with kind works nicely (what I've been using to mess around).
Trick is the mismatch between the container package versions and kernel versions I have handy, and the general difficulty of doing kernel event tracing in a container. Much as it's a bit of one-off work, doing tracing this way will save me time in the long run.
Hopefully that will just take care of it, will see.
Ok, let me know if you still see any issues.
From scratch rebuild on an aarch64 debian bookworm VM with current versions of packages got rid of all the errors. Makes me want to know where they came from, but much better place to start. I do get a nasty warning turned error out of boost graph because of a new gcc-12 warning that's getting triggered despite it being in a system location. Not sure how that's getting past -isystem
but it's not great. Looking into the perf issues, may try and look at the boost issue but I think the flags are all deep in the contributed M4 scripts. 😬