optionally withhold some cores on each node for system tasks
Problem: in some situations, applications run faster when some number of cores are reserved for system services. In other situations, applications perform best when all the cores are available to the application.
Apparently LSF on sierra will by default only allocate 40 of 44 cores per node to a job. There is an option users can add to their job request to indicate that 42 or 44 be allocated.
Hmm, alternatively, maybe a simpler solution not involving fluxion would be add a way to configure a batch/alloc instance to set aside N cores per node from the R assigned to it when it bootstraps, thus fluxion would be unaware of those reserved cores.
Interesting. The brokers could perhaps then bind themselves to the reserved cores to avoid impacting job tasks run in the instance. If this were supported in the broker config, then it could conceivably be passed to the instance via something like --conf=reserved-cores=...
Moving to flux-core.
Possibly related to #4372? (although may be outdated discussion)
Good catch @wihobbs!
Adam Bertsch was just asking about this. I was wondering if the things that we were discussing for rediscovering GPU resources (#6418) would be useful for rediscovering the system task cores when users wanted them, but that discussion in #4372 makes it sound like probably not.
It would really depend on how the feature was implemented. In retrospect, enabling a low-noise config on a per-instance basis might be easiest, this could result in binding the broker+modules to a certain set of cores (configurable?) and remove those from the R shared with the scheduler. Admittedly, this is an off the cuff statement and may not handle all the requirements properly (e.g. it doesn't handle the case of binding other system processes to the system reserved cores)
Possibly related #5240
See comment in #5240 for a prototype (found here)
@eleon has a few thoughts here shared today with @trws and me. Tagging him so he can share more.
Thanks for the pointer, @wihobbs
The issue I see is that on El Cap systems, Flux is giving users cores that are "reserved" for system tasks (more details below).
My suggestion for working with systems that provide noise mitigation via core specialization is as follows:
- If users ask for a full node (
-x), then Flux provides all of the cores, including system cores. - If users ask for a number of cores (
-c), then Flux provides the requested number of cores, but draws them from a pool that excludes the system cores.
Right now, when a user requests, say, 8 cores, one of the cores is a system core. This causes an issue for mpibind, because mpibind will only schedule work on user cores (by default). This means that the job will only run on 7 cores, potentially oversubscribing the resources:
leon@tuolumne1071:~$ flux run -N1 -n8 -c1 -o mpibind=off -o cpu-affinity=per-task sh -c 'taskset -pc $$'
pid 75901's current affinity list: 89,185
pid 75900's current affinity list: 88,184
pid 75902's current affinity list: 90,186
pid 75903's current affinity list: 91,187
pid 75904's current affinity list: 92,188
pid 75905's current affinity list: 93,189
pid 75907's current affinity list: 95,191
pid 75906's current affinity list: 94,190
leon@tuolumne1071:~$ flux run -N1 -n8 -c1 sh -c 'taskset -pc $$'
pid 75913's current affinity list: 89
pid 75914's current affinity list: 185
pid 75915's current affinity list: 90,186
pid 75916's current affinity list: 91,187
pid 75917's current affinity list: 92,188
pid 75918's current affinity list: 93,189
pid 75919's current affinity list: 94,190
pid 75920's current affinity list: 95,191
This is unfortunately expected behavior because Flux is configured with all available cores. So Flux has no way to know which cores are reserved for the system. The only thing configured to know which cores are system cores is mpibind, which has no way to easily communicate this to Flux.
A better long term solution would be to configure Flux to know which cores are reserved. This could be done now by changing the system config to include only the cores not reserved for the system. E.g. instead of cores = "0-95", use cores = "IDSET OF USER CORES". The drawback here is that a user could not request 96 cores at the system level. However, since the system instance is configured for exclusive node scheduling, perhaps this is not necessarily a showstopper?
The other, perhaps better long term solution would be to have a different way to configure Flux with the reserved cores separately from the main resource config. There was a kind of experimental PR here: https://github.com/flux-framework/flux-core/pull/6547, but at this point that's just an idea.
@trws had some more concrete ideas which he can share here.
Posted in a slack discussion, but I'm reposting here in case this workaround is useful while we determine the best path forward.
You can launch a subinstance of Flux which has the reserved cores removed from R by ensuring the mpibind plugin is active for the subinstance job, then using resource.rediscover=true so that the resource config is rediscovered using hwloc. Since mpibind has set the CPU affinity mask before topology discovery occurs, the new Flux instance will only be configured with the available cores.
The best way to do this now is to use a helper script, here start.sh
#!/bin/sh
cat <<EOF >${FLUX_JOB_TMPDIR}/conf.toml
resource.rediscover = true
EOF
flux start -c ${FLUX_JOB_TMPDIR} "$@"
Then launch this script instead of using flux alloc or flux run flux start:
$ flux resource list
STATE PROPERTIES NNODES NCORES NGPUS NODELIST
free pall,pdeb+ 1 96 4 tuolumne1006
allocated 0 0 0
down 0 0 0
$ flux run -o pty.interactive -n1 -c96 ./start.sh
$ flux resource list
STATE NNODES NCORES NGPUS NODELIST
free 1 84 4 tuolumne1006
allocated 0 0 0
down 0 0 0
Jobs run in this subinstance should only use the user cores not the system cores.
Pinging @pearce8, @michaelmckinsey1, and @amroakmal on this since they're running into issues with this on Tuo as part of the Benchpark project.
My concrete thought on this was that we could apply a couple of small things to make this a bit better. Probably not a full solution but something.
- add the
resource.rediscoverto the sub-instance config by default - Add an option, with a plugin probably, to say
give_me_all_the_cores_darn_itthat removes that option from the config - Add a jobspec verifier that will reject jobs that request more than the non-system cores at the system level (this sounds orthogonal, but means we don't have the problem for run or submit at that level)
- Auto-enclose the run or submit at the top level in an instance so we avoid the blocked cores for those too
I think we already have the option for number 4, a small plugin to add, or not add, the rediscover thing would be the main part.
add the
resource.rediscoverto the sub-instance config by default
This will re-run hwloc topology load, so will find all the cores unless the broker has its affinity set to a subset. So this could work if we have mpibind set the affinity mask for the broker (right now mpibind is disabled for instances of Flux). Is this what you were thinking?
Add an option, with a plugin probably, to say
give_me_all_the_cores_darn_itthat removes that option from the config
In this scenario, mpibind would also need to be disabled, or the resource.norestrict option enabled in the config.
Auto-enclose the run or submit at the top level in an instance so we avoid the blocked cores for those too
The system can't rewrite user arguments, except perhaps at the job shell level, because of the jobspec signature. I wonder if it would be easier to just document that to use all cores you have to launch a subinstance with flux batch or flux alloc and the appropriate options?
FWIW, I think the approach we were thinking of using for now would be:
- Configure the system instance of Flux with the set of non-system cores, e.g. replace
cores=0-95withcores=1,3,5...or whatever - Document that if users want to use all the cores, they have to run with
flux allocorflux batchand--conf=resource.norestrict=true --conf=resource.rediscover=true(once we have CLI plugins this could be a single option)
add the
resource.rediscoverto the sub-instance config by defaultThis will re-run hwloc topology load, so will find all the cores unless the broker has its affinity set to a subset. So this could work if we have mpibind set the affinity mask for the broker (right now mpibind is disabled for instances of Flux). Is this what you were thinking?
Yup, exactly that. Originally I was thinking we use the plugin to subset the R to the non-system cores instead, which would be more efficient but more work.
Add an option, with a plugin probably, to say
give_me_all_the_cores_darn_itthat removes that option from the configIn this scenario, mpibind would also need to be disabled, or the
resource.norestrictoption enabled in the config.
Quite right.
Auto-enclose the run or submit at the top level in an instance so we avoid the blocked cores for those too
The system can't rewrite user arguments, except perhaps at the job shell level, because of the jobspec signature. I wonder if it would be easier to just document that to use all cores you have to launch a subinstance with
flux batchorflux allocand the appropriate options? Sure,FWIW, I think the approach we were thinking of using for now would be:
- Configure the system instance of Flux with the set of non-system cores, e.g. replace
cores=0-95withcores=1,3,5...or whatever- Document that if users want to use all the cores, they have to run with
flux allocorflux batchand--conf=resource.norestrict=true --conf=resource.rediscover=true(once we have CLI plugins this could be a single option)
That sounds like it would work, it's cleaner in a way as well. I was thinking the other direction since it would mean we could subset R rather than rediscovering if we want to, but I suppose we could add as easily as remove so I'm not sure it matters.
Good points, I was just making sure the different approaches were documented here.