flux-sched test impact of using rv1 vs rv1_nosched on instance performance

In a planning meeting, the idea of running with rv1 match format enabled in production was discussed as a stopgap solution for #991. However, the performance or other impact due to that change was not known. We should characterize any impact due to this configuration so we can make decisions based on results.

Feb 23 '23 22:02 grondo

Here's a first attempt at a parameter study that investigates different job sizes and counts with rv1 vs rv1_nosched. Each test instance runs the script found at the bottom of this comment with different parameters for match-format, number of jobs, and nodes per job (all jobs allocate nodes exclusively as would be the case for a system instance).

The suite of tests were launched on corona via the following bulksubmit invocation:

flux bulksubmit -n1 -c4 --watch --shuffle \
        --output=results.{seq} \
        --env=NJOBS={0} --env=NNODES={1} --env=MATCH_FORMAT={2} \
        flux start ./test.sh \
          ::: 10 100 1000 2000 8000 \
          ::: 1 10 100 1000 \
          ::: rv1 rv1_nosched

A couple of the parameter combinations caused an issue that may need to be investigated. The nodes/job=1000 cases for njobs=1000,2000, and 8000 all failed due to running out of space in /var/tmp. This occurred even thought the content store db was only around a GB because the rank 0 broker RSS was 160GB. Note, in this case errors were logged but the instance just stopped processing jobs.

Other instances reached a maximum RSS of ~2G, so I'm not yet sure what the issue was (more investigation needed!) Perhaps we are caching R in memory somewhere with the full scheduling key intact. Note in the data below a 1000 node exclusive R takes 17MiB - that is one R object :astonished: .

#!/bin/bash
MATCH_FORMAT=${MATCH_FORMAT:-rv1}
NJOBS=${NJOBS:-100}
NNODES=${NNODES:-16}
printf "MATCH_FORMAT=${MATCH_FORMAT} NJOBS=$NJOBS NODES/JOB=$NNODES\n"

flux module remove sched-fluxion-qmanager
flux module remove sched-fluxion-resource
flux module remove resource
flux config load <<EOF
[sched-fluxion-qmanager]
queue-policy = "easy"
[sched-fluxion-resource]
match-format = "$MATCH_FORMAT"

[queues.debug]
requires = ["debug"]

[queues.batch]
requires = ["batch"]

[resource]
noverify = true
norestrict = true

[[resource.config]]
hosts = "test[0-1999]"
cores = "0-47"
gpus = "0-8"

[[resource.config]]
hosts = "test[0-1899]"
properties = ["batch"]

[[resource.config]]
hosts = "test[1900-1999]"
properties = ["debug"]
EOF
flux config get | jq '."sched-fluxion-resource"'
flux module load resource noverify monitor-force-up
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux queue start --all --quiet
flux resource list
t0=$(date +%s.%N)
flux submit -N$NNODES --queue=batch --cc=1-$NJOBS \
    --setattr=exec.test.run_duration=1ms \
    --quiet --wait hostname

ELAPSED=$(echo $(date +%s.%N) - $t0 | bc -l)
THROUGHPUT=$(echo $NJOBS/$ELAPSED | bc -l)
R_SIZE=$(flux job info $(flux job last) R | wc -c)
OBJ_COUNT=$(flux module stats content-sqlite | jq .object_count)
DB_SIZE=$(flux module stats content-sqlite | jq .dbfile_size)

printf "%-12s %5d %4d %8.2f %8.2f %12d %12d %12d\n" \
        $MATCH_FORMAT $NJOBS $NNODES $ELAPSED $THROUGHPUT \
        $R_SIZE $OBJ_COUNT $DB_SIZE

Here's the initial results. The rv1_nosched njobs=8000 nodes/job=1000 case ran out of time after 8hrs.

FORMAT       NJOBS SIZE  RUNTIME      JPS       R_SIZE     NOBJECTS      DB_SIZE
rv1_nosched     10    1     0.67    14.92          215          430       229376
rv1_nosched     10 1000    16.75     0.60          229          536       249856
rv1_nosched     10   10     0.66    15.10          232          480       249856
rv1_nosched     10  100     1.20     8.33          226          504       258048
rv1_nosched    100    1     1.20    83.55          215         3668      2170880
rv1_nosched    100   10     1.64    60.86          226         4207      2441216
rv1_nosched    100 1000   338.28     0.30          229         4983      2899968
rv1_nosched    100  100    86.50     1.16          232         4691      3043328
rv1_nosched   1000    1     8.43   118.68          212        38020     23445504
rv1_nosched   1000   10   803.42     1.24          232        44892     33071104
rv1_nosched   1000  100  1237.58     0.81          232        46465     37584896
rv1_nosched   1000 1000  4014.96     0.25          229        50130     44744704
rv1_nosched   2000    1    21.52    92.95          215        75639     49369088
rv1_nosched   2000   10  1811.92     1.10          226        97349     82919424
rv1_nosched   2000  100  2497.18     0.80          232        99770     86978560
rv1_nosched   2000 1000  8451.37     0.24          229       101053     98111488
rv1_nosched   8000    1  5525.57     1.45          215       335169    293126144
rv1_nosched   8000  100 11872.30     0.67          232       382905    397213696
rv1_nosched   8000   10  9500.90     0.84          232       396622    425652224

FORMAT       NJOBS SIZE  RUNTIME      JPS       R_SIZE     NOBJECTS      DB_SIZE
rv1             10    1     0.70    14.37        18961          476       282624
rv1             10   10     0.99    10.09       184848          552       491520
rv1             10  100     4.81     2.08      1808642          619      2146304
rv1            100    1     1.51    66.37        18961         4174      2850816
rv1            100   10     5.14    19.47       181352         4773      4882432
rv1             10 1000    92.67     0.11     18228323          653     18608128
rv1            100  100   165.15     0.61      1808642         5552     21889024
rv1           1000    1    10.69    93.56        18609        40268     30941184
rv1           2000    1    26.25    76.20        18961        79640     59781120
rv1           1000   10   898.83     1.11       182538        52837     62410752
rv1           2000   10  2069.59     0.97       181352       107071    133500928
rv1            100 1000  1134.78     0.09     18228323         6213    187297792
rv1           1000  100  2127.79     0.47      1820448        58699    232325120
rv1           8000    1  5594.02     1.43        18730       339055    327118848
rv1           2000  100  4472.80     0.45      1808642       113545    470863872
rv1           8000   10 10955.79     0.73       182538       446144    689192960
rv1           8000  100 19206.11     0.42      1791884       459748   2043412480

Jun 23 '23 03:06 grondo

Ran similar tests without node exclusive matching. All I did was change -N NNODES to -n 48*NNODES. Similar results, but a few of the larger rv1 OOMed, so I didn't get as many results. I guess all this shows is that node exclusive scheduling isn't the problem with the very slow scheduling here:

FORMAT       NJOBS SIZE  RUNTIME      JPS       R_SIZE     NOBJECTS      DB_SIZE
rv1_nosched     10    1     1.47     6.81          201          436       233472
rv1_nosched     10   10     1.50     6.68          218          462       241664
rv1_nosched     10 1000    40.49     0.25          215          533       249856
rv1_nosched     10  100     2.11     4.74          212          518       266240
rv1             10    1     1.45     6.89        16140          459       270336
rv1             10   10     1.78     5.61       156764          548       450560
rv1             10  100     5.07     1.97      1533328          617      1843200
rv1_nosched    100    1     9.58    10.44          201         3915      2404352
rv1_nosched    100   10    10.37     9.64          212         4158      2420736
rv1_nosched    100  100   567.36     0.18          218         4659      2691072
rv1_nosched    100 1000  1915.72     0.05          215         5011      2760704
rv1            100    1     9.86    10.14        16140         4146      2822144
rv1            100   10    13.15     7.61       153808         4474      4370432
rv1             10 1000   106.01     0.09     15452800          653     15654912
rv1            100  100   648.12     0.15      1533328         5634     19181568
rv1_nosched   1000    1    92.45    10.82          198        41103     29061120
rv1_nosched   1000 1000 19009.78     0.05            0        36650     31846400
rv1           1000    1    94.45    10.59        15842        41364     32202752
rv1_nosched   1000   10  5944.04     0.17          218        45445     33878016
rv1_nosched   1000  100  8331.68     0.12          212        46789     37441536
rv1           1000   10  6220.77     0.16       154814        49858     58167296
rv1_nosched   2000    1   817.74     2.45          201        86857     65691648
rv1           2000    1   812.90     2.46        16140        87815     70770688
rv1_nosched   2000   10 13523.46     0.15          212        94005     72663040
rv1_nosched   2000  100 17157.62     0.12          218        94769     74760192
rv1           2000   10 13874.15     0.14       153808       101857    122454016
rv1            100 1000  2848.30     0.04     15452800         6160    157343744
rv1           1000  100  9320.08     0.11      1533328        56520    202543104
rv1           2000  100 18836.20     0.11      1543334       114569    417652736

Jun 26 '23 19:06 grondo

I was going to try this same set of experiments with sched-simple, but it turns out this test setup won't work with the simple scheduler. This is because the instance is still waiting for ranks 1-1999 to come up, so only one rank's worth of resources are available.

@garlick and I were wondering after the fact how this happens to work with Fluxion.

It turns out that when Fluxion marks "all" resources down, it uses the instance size to mark ranks 0-(size-1) down. This particular test uses an instance size of 1, so only one resource actually gets set down (rank 0, and then it immediately gets marked up).

https://github.com/flux-framework/flux-sched/blob/32f74d6260c033d32d4af606b60a0eef4c1dbfd7/resource/modules/resource_match.cpp#L1114-L1132

Jun 26 '23 19:06 grondo

It turns out that when Fluxion marks "all" resources down, it uses the instance size to mark ranks 0-(size-1) down. This particular test uses an instance size of 1, so only one resource actually gets set down (rank 0, and then it immediately gets marked up).

That sounds like a bug, if ironically a useful one for testing this.

Jun 26 '23 21:06 trws

That sounds like a bug, if ironically a useful one for testing this.

Agree. Issue opened: #1040

Jun 26 '23 21:06 grondo

Ok, figured out the core resource module can be loaded with the monitor-force-up option and was able to run these same tests with sched-simple juts for comparison:

FORMAT       NJOBS SIZE  RUNTIME      JPS       R_SIZE     NOBJECTS      DB_SIZE
sched-simple    10   10     0.64    15.55          143          459       233472
sched-simple    10    1     0.61    16.27          133          457       241664
sched-simple    10  100     0.79    12.60          147          512       262144
sched-simple    10 1000     6.84     1.46          143          584       274432
sched-simple   100    1     1.31    76.21          133         4071      2514944
sched-simple   100   10     1.50    66.66          143         4340      2629632
sched-simple   100  100     3.21    31.12          151         4867      3047424
sched-simple   100 1000    70.13     1.43          143         5531      3964928
sched-simple  1000    1     8.84   113.06          133        40972     30142464
sched-simple  1000   10    10.63    94.08          143        44390     32665600
sched-simple  1000  100    28.37    35.25          147        50234     43425792
sched-simple  1000 1000   717.07     1.39          143        55068     46710784
sched-simple  2000    1    16.86   118.59          133        82342     62046208
sched-simple  2000   10    20.46    97.75          143        86788     66076672
sched-simple  2000  100    56.88    35.16          151       100292     91693056
sched-simple  2000 1000  1425.18     1.40          143       110692    111550464
sched-simple  8000    1    69.32   115.40          135       309652    252616704
sched-simple  8000   10    84.06    95.17          143       335359    299745280
sched-simple  8000  100   218.65    36.59          147       377497    396423168
sched-simple  8000 1000  5736.57     1.39          143       448211    602529792

Jun 27 '23 02:06 grondo

FYI - I edited the test script above to add the monitor-force-up option to flux module load resource. This will be required once #1042 is merged.

Jun 27 '23 14:06 grondo

@grondo this might be a separate issue, but would it be possible to mock the state of nodes too? E.g. that some subset in the list is DOWN? The context here is for bursting - we want to mock nodes that don’t exist as DOWN and then provide some actually existing nodes (so hopefully we can accomplish similar without the entire thing being a mock!)

Jun 27 '23 14:06 vsoch

The default state of nodes is down. Is there a situation where they need to be forced down after having been mocked up or actually up? (like shrinking back down)? Anyway not a scheduler issue per se so I'd suggest opening a flux-core issue (if there is an issue).

Jun 27 '23 15:06 garlick

If the default state is down, in these examples how do they fake run? Where is the logic happening that allows them to do that?

Jun 27 '23 18:06 vsoch

This option: --setattr=exec.test.run_duration=1ms says instead of actually running the job, just sleep for a millisecond.

Jun 27 '23 18:06 garlick

Yes! I derived that from here: https://github.com/flux-framework/flux-core/blob/49c59b57bb99852831745cd4cc1052eb56194365/src/modules/job-exec/testexec.c#L68 but I don't understand how it actually works to allow it to run (and schedule on nodes that don't actually exist) and not just determine that the resources are not available. I think maybe I'm asking about a level of detail deeper than that attribute?

Jun 27 '23 18:06 vsoch

The default state of nodes in the scheduler is supposed to be down until the resource.acquire protocol says they are up. A bug in Fluxion (discussed above) set only the first node (of 2000 being simulated here) down, then when that broker came up that node was marked UP and all resources then appeared UP through happenstance.

To really force resources up, the resource module monitor-force-up option is required.

(assuming that is what you were actually asking about?)

Jun 27 '23 18:06 grondo

The default state of nodes in the scheduler is supposed to be down until the resource.acquire protocol says they are up. A bug in Fluxion (discussed above) set only the first node (of 2000 being simulated here) down, then when that broker came up that node was marked UP and all resources then appeared UP through happenstance.

Ah so the above wasn't supposed to happen - without the bug the resources would remain down, is that correct? And the reason it was happening is here:

https://github.com/flux-framework/flux-core/blob/b20460a6e8f2ef9938a2e8bab898ff505b39df3a/src/modules/resource/monitor.c#L249-L272

Ok so assuming we set monitor-force-up and a monitor returns successfully, do the jobs start scheduling on nodes (that are thought to be up) because of the flux_reactor_run? https://github.com/flux-framework/flux-core/blob/b20460a6e8f2ef9938a2e8bab898ff505b39df3a/src/modules/resource/resource.c#L499. I'm trying to understand how once the node is "up" we get jobs assigned to it, sorry for all the questions.

Jun 27 '23 18:06 vsoch

The resource.acquire protocol is described here: https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_28.html

Basically scheduler starts up and asks resource "what resources do I have"? Resource says here's a pile of nodes/cores whatever and none of them are up. Oh, now two are up. On now four are up. (or all are up). Scheduler is simultaneously receiving "alloc" requests from the job manager asking for a resource allocation for pending jobs. So the scheduler's job is to decide which resources to allocate to the jobs requesting them. It should only allocate the resources that are up of course. Does that help?

Jun 27 '23 18:06 garlick

So in this case, all the scheduler knows is that the lead broker is up, and the lead broker is said to have all of the resources of the fake nodes (this part here saying that the resource spec can come from a config file, what we did):

This resource set may be obtained from a configuration file, dynamically discovered, or assigned by the enclosing instance

So this response is just going to reflect what we put in the that broker config, and we don't verify any of them (that would be done with hwloc?) because we added :

[resource]
noverify = true

And then because we are in this mock mode, there isn't an actual job run, it just schedules (assigns the job to some fake nodes), waits for the run duration, and then calls it completed? So does the command hostname matter at all? And then what if you have some set of real resources and some set of fake resources (so a mix of both of those cases?)

Jun 27 '23 18:06 vsoch

I think you got it! The actual command shouldn't matter.

If you mix real and fake resources, the scheduler doesn't know which is which so it'll be fine if you are mocking execution, and sometimes fine and sometimes not if you aren't.

Jun 27 '23 18:06 garlick

If you mix real and fake resources, the scheduler doesn't know which is which so it'll be fine if you are mocking execution, and sometimes fine and sometimes not if you aren't.

I'll try this next! Thanks for answering my questions!

Jun 27 '23 19:06 vsoch

The default state of nodes is down. Is there a situation where they need to be forced down after having been mocked up or actually up? (like shrinking back down)? Anyway not a scheduler issue per se so I'd suggest opening a flux-core issue (if there is an issue).

This thread, and this question, is probably a good one to chat with @tpatki, @JaeseungYeom and maybe @milroy about too. We've talked about needing to work up a newer-generation simulator for some of the Fractale work if that happens, and this would fall under that pretty neatly.

Jun 27 '23 20:06 trws

OK. I feel like we should take it out of this issue though as the original topic is pretty important and the results presented above are significant and deserve some attention.

Maybe open a flux-core issue after discussing requirements? Could be a team meeting topic if needed?

Jun 27 '23 20:06 garlick

To that point, I'm trying to repro some of this, just to be sure, you got a lot of job-manager/jobtap errors right @grondo? This is what I'm getting for a small one for example:

root@1f9bd98b509c:/workspaces/flux-sched/build# time flux start env NJOBS=10 NNODES=10 MATCH_FORMAT=rv1_nosched ../tes
t_rv1_perf.sh
MATCH_FORMAT=rv1_nosched NJOBS=10 NODES/JOB=10
{
  "match-format": "rv1_nosched"
}
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free batch        1900    91200    17100 test[0-1899]
      free debug         100     4800      900 test[1900-1999]
 allocated                 0        0        0 
      down                 0        0        0 
Jun 27 20:20:00.837395 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837427 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837459 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837555 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837602 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837648 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837683 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837716 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837749 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.837776 job-manager.err[0]: jobtap: job.new: callback returned error
Jun 27 20:20:00.899556 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.901493 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.903428 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.905413 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.907354 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.909447 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.911362 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.913584 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.915916 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
Jun 27 20:20:00.917784 job-manager.err[0]: jobtap: job.inactive-add: callback returned error
flux-job: job history is empty
flux-job: error parsing jobid: "R"
rv1_nosched     10   10     0.29    34.53            0          415       180224

real    0m3.199s
user    0m2.918s
sys     0m0.366s

Jun 27 '23 20:06 trws

To that point, I'm trying to repro some of this, just to be sure, you got a lot of job-manager/jobtap errors right @grondo?

No I don't see any of those jobtap errors in my runs. Looks like you are inside a docker container? Let me try in that environment (I was running on corona)

Jun 27 '23 20:06 grondo

@trws - I can't reproduce the errors above in the latest flux-sched docker container.

I performed the following steps:

docker pull fluxrm/flux-sched:latest
docker run -ti fluxrm/flux-sched
paste the test script from https://github.com/flux-framework/flux-sched/issues/1009#issuecomment-1603636498 as test.sh
flux start ./test.sh

Jun 27 '23 21:06 grondo

Thanks @grondo, I'm trying to get an environment where I can get a decent perf trace anyway, so I'm building one up in a long-term VM (the containers are proving a bit of a problem for getting perf to work, at all). Hopefully that will just take care of it, will see.

Jun 27 '23 21:06 trws

@trws if you just need a small setup, the flux operator with kind works nicely (what I've been using to mess around).

Jun 27 '23 21:06 vsoch

Trick is the mismatch between the container package versions and kernel versions I have handy, and the general difficulty of doing kernel event tracing in a container. Much as it's a bit of one-off work, doing tracing this way will save me time in the long run.

Jun 27 '23 21:06 trws

Hopefully that will just take care of it, will see.

Ok, let me know if you still see any issues.

Jun 27 '23 22:06 grondo

From scratch rebuild on an aarch64 debian bookworm VM with current versions of packages got rid of all the errors. Makes me want to know where they came from, but much better place to start. I do get a nasty warning turned error out of boost graph because of a new gcc-12 warning that's getting triggered despite it being in a system location. Not sure how that's getting past -isystem but it's not great. Looking into the perf issues, may try and look at the boost issue but I think the flags are all deep in the contributed M4 scripts. 😬

Jun 27 '23 22:06 trws

flux-sched flux-sched copied to clipboard

test impact of using rv1 vs rv1_nosched on instance performance

flux-sched
flux-sched copied to clipboard