flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

optionally withhold some cores on each node for system tasks

Open garlick opened this issue 1 year ago • 18 comments

Problem: in some situations, applications run faster when some number of cores are reserved for system services. In other situations, applications perform best when all the cores are available to the application.

Apparently LSF on sierra will by default only allocate 40 of 44 cores per node to a job. There is an option users can add to their job request to indicate that 42 or 44 be allocated.

garlick avatar Sep 26 '24 16:09 garlick

Hmm, alternatively, maybe a simpler solution not involving fluxion would be add a way to configure a batch/alloc instance to set aside N cores per node from the R assigned to it when it bootstraps, thus fluxion would be unaware of those reserved cores.

garlick avatar Sep 26 '24 16:09 garlick

Interesting. The brokers could perhaps then bind themselves to the reserved cores to avoid impacting job tasks run in the instance. If this were supported in the broker config, then it could conceivably be passed to the instance via something like --conf=reserved-cores=...

grondo avatar Sep 26 '24 16:09 grondo

Moving to flux-core.

garlick avatar Sep 26 '24 16:09 garlick

Possibly related to #4372? (although may be outdated discussion)

wihobbs avatar Sep 26 '24 16:09 wihobbs

Good catch @wihobbs!

garlick avatar Sep 26 '24 16:09 garlick

Adam Bertsch was just asking about this. I was wondering if the things that we were discussing for rediscovering GPU resources (#6418) would be useful for rediscovering the system task cores when users wanted them, but that discussion in #4372 makes it sound like probably not.

ryanday36 avatar Dec 09 '24 16:12 ryanday36

It would really depend on how the feature was implemented. In retrospect, enabling a low-noise config on a per-instance basis might be easiest, this could result in binding the broker+modules to a certain set of cores (configurable?) and remove those from the R shared with the scheduler. Admittedly, this is an off the cuff statement and may not handle all the requirements properly (e.g. it doesn't handle the case of binding other system processes to the system reserved cores)

grondo avatar Dec 09 '24 17:12 grondo

Possibly related #5240

grondo avatar Dec 12 '24 02:12 grondo

See comment in #5240 for a prototype (found here)

grondo avatar Jan 08 '25 23:01 grondo

@eleon has a few thoughts here shared today with @trws and me. Tagging him so he can share more.

wihobbs avatar Jun 10 '25 13:06 wihobbs

Thanks for the pointer, @wihobbs

The issue I see is that on El Cap systems, Flux is giving users cores that are "reserved" for system tasks (more details below).

My suggestion for working with systems that provide noise mitigation via core specialization is as follows:

  1. If users ask for a full node (-x), then Flux provides all of the cores, including system cores.
  2. If users ask for a number of cores (-c), then Flux provides the requested number of cores, but draws them from a pool that excludes the system cores.

Right now, when a user requests, say, 8 cores, one of the cores is a system core. This causes an issue for mpibind, because mpibind will only schedule work on user cores (by default). This means that the job will only run on 7 cores, potentially oversubscribing the resources:

leon@tuolumne1071:~$ flux run -N1 -n8 -c1 -o mpibind=off -o cpu-affinity=per-task sh -c 'taskset -pc $$' 
pid 75901's current affinity list: 89,185
pid 75900's current affinity list: 88,184
pid 75902's current affinity list: 90,186
pid 75903's current affinity list: 91,187
pid 75904's current affinity list: 92,188
pid 75905's current affinity list: 93,189
pid 75907's current affinity list: 95,191
pid 75906's current affinity list: 94,190
leon@tuolumne1071:~$ flux run -N1 -n8 -c1 sh -c 'taskset -pc $$' 
pid 75913's current affinity list: 89
pid 75914's current affinity list: 185
pid 75915's current affinity list: 90,186
pid 75916's current affinity list: 91,187
pid 75917's current affinity list: 92,188
pid 75918's current affinity list: 93,189
pid 75919's current affinity list: 94,190
pid 75920's current affinity list: 95,191

eleon avatar Jun 11 '25 09:06 eleon

This is unfortunately expected behavior because Flux is configured with all available cores. So Flux has no way to know which cores are reserved for the system. The only thing configured to know which cores are system cores is mpibind, which has no way to easily communicate this to Flux.

A better long term solution would be to configure Flux to know which cores are reserved. This could be done now by changing the system config to include only the cores not reserved for the system. E.g. instead of cores = "0-95", use cores = "IDSET OF USER CORES". The drawback here is that a user could not request 96 cores at the system level. However, since the system instance is configured for exclusive node scheduling, perhaps this is not necessarily a showstopper?

The other, perhaps better long term solution would be to have a different way to configure Flux with the reserved cores separately from the main resource config. There was a kind of experimental PR here: https://github.com/flux-framework/flux-core/pull/6547, but at this point that's just an idea.

@trws had some more concrete ideas which he can share here.

grondo avatar Jun 11 '25 16:06 grondo

Posted in a slack discussion, but I'm reposting here in case this workaround is useful while we determine the best path forward.

You can launch a subinstance of Flux which has the reserved cores removed from R by ensuring the mpibind plugin is active for the subinstance job, then using resource.rediscover=true so that the resource config is rediscovered using hwloc. Since mpibind has set the CPU affinity mask before topology discovery occurs, the new Flux instance will only be configured with the available cores.

The best way to do this now is to use a helper script, here start.sh

#!/bin/sh
cat <<EOF >${FLUX_JOB_TMPDIR}/conf.toml
resource.rediscover = true
EOF
flux start -c ${FLUX_JOB_TMPDIR} "$@"

Then launch this script instead of using flux alloc or flux run flux start:

$ flux resource list
     STATE PROPERTIES NNODES NCORES NGPUS NODELIST
      free pall,pdeb+      1     96     4 tuolumne1006
 allocated                 0      0     0 
      down                 0      0     0 
$ flux run -o pty.interactive -n1 -c96 ./start.sh
$ flux resource list
     STATE NNODES NCORES NGPUS NODELIST
      free      1     84     4 tuolumne1006
 allocated      0      0     0 
      down      0      0     0 

Jobs run in this subinstance should only use the user cores not the system cores.

grondo avatar Jun 11 '25 16:06 grondo

Pinging @pearce8, @michaelmckinsey1, and @amroakmal on this since they're running into issues with this on Tuo as part of the Benchpark project.

ilumsden avatar Jun 11 '25 17:06 ilumsden

My concrete thought on this was that we could apply a couple of small things to make this a bit better. Probably not a full solution but something.

  1. add the resource.rediscover to the sub-instance config by default
  2. Add an option, with a plugin probably, to say give_me_all_the_cores_darn_it that removes that option from the config
  3. Add a jobspec verifier that will reject jobs that request more than the non-system cores at the system level (this sounds orthogonal, but means we don't have the problem for run or submit at that level)
  4. Auto-enclose the run or submit at the top level in an instance so we avoid the blocked cores for those too

I think we already have the option for number 4, a small plugin to add, or not add, the rediscover thing would be the main part.

trws avatar Jun 13 '25 13:06 trws

add the resource.rediscover to the sub-instance config by default

This will re-run hwloc topology load, so will find all the cores unless the broker has its affinity set to a subset. So this could work if we have mpibind set the affinity mask for the broker (right now mpibind is disabled for instances of Flux). Is this what you were thinking?

Add an option, with a plugin probably, to say give_me_all_the_cores_darn_it that removes that option from the config

In this scenario, mpibind would also need to be disabled, or the resource.norestrict option enabled in the config.

Auto-enclose the run or submit at the top level in an instance so we avoid the blocked cores for those too

The system can't rewrite user arguments, except perhaps at the job shell level, because of the jobspec signature. I wonder if it would be easier to just document that to use all cores you have to launch a subinstance with flux batch or flux alloc and the appropriate options?

FWIW, I think the approach we were thinking of using for now would be:

  1. Configure the system instance of Flux with the set of non-system cores, e.g. replace cores=0-95 with cores=1,3,5... or whatever
  2. Document that if users want to use all the cores, they have to run with flux alloc or flux batch and --conf=resource.norestrict=true --conf=resource.rediscover=true (once we have CLI plugins this could be a single option)

grondo avatar Jun 13 '25 13:06 grondo

add the resource.rediscover to the sub-instance config by default

This will re-run hwloc topology load, so will find all the cores unless the broker has its affinity set to a subset. So this could work if we have mpibind set the affinity mask for the broker (right now mpibind is disabled for instances of Flux). Is this what you were thinking?

Yup, exactly that. Originally I was thinking we use the plugin to subset the R to the non-system cores instead, which would be more efficient but more work.

Add an option, with a plugin probably, to say give_me_all_the_cores_darn_it that removes that option from the config

In this scenario, mpibind would also need to be disabled, or the resource.norestrict option enabled in the config.

Quite right.

Auto-enclose the run or submit at the top level in an instance so we avoid the blocked cores for those too

The system can't rewrite user arguments, except perhaps at the job shell level, because of the jobspec signature. I wonder if it would be easier to just document that to use all cores you have to launch a subinstance with flux batch or flux alloc and the appropriate options? Sure,

FWIW, I think the approach we were thinking of using for now would be:

  1. Configure the system instance of Flux with the set of non-system cores, e.g. replace cores=0-95 with cores=1,3,5... or whatever
  2. Document that if users want to use all the cores, they have to run with flux alloc or flux batch and --conf=resource.norestrict=true --conf=resource.rediscover=true (once we have CLI plugins this could be a single option)

That sounds like it would work, it's cleaner in a way as well. I was thinking the other direction since it would mean we could subset R rather than rediscovering if we want to, but I suppose we could add as easily as remove so I'm not sure it matters.

trws avatar Jun 13 '25 14:06 trws

Good points, I was just making sure the different approaches were documented here.

grondo avatar Jun 13 '25 15:06 grondo