flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

Tracking issue for shrinking JGF

Open trws opened this issue 6 months ago • 1 comments

After a discussion between @milroy, @jameshcorbett and I today this is the running plan for how we're going to shrink JGF and get rabbit scheduling working with JGF on larger systems.

  1. [ ] @jameshcorbett Make it so that a subinstance can use R for loading while the parent is using RV1 with a sched key
  2. [ ] @trws Add a versioning scheme and a way to identify partial and complete JGF
  3. [ ] (deps: 1) @jameshcorbett Add an option to the JGF writer to only write out the root exclusive vertex of any exclusively allocated subtrees. Quick test on tuo's data showed this to reduce a full system job JGF from 21mb to 700k (uncompressed).
  4. [ ] @milroy Work on template option for config so we can describe a node or a rack and re-use it to describe a system rather than having to write out every individual resource
  5. [ ] all (deps: 3, 4) combine 3 and 4 to write out a templated version of the partial JGF so a child instance can use it to load as a full JGF instance
  6. [ ] @trws factor out separate graph builder objects in readers to ensure we can support graph build time computations
  7. [ ] @trws based on what we need for 3, prioritize adding easy traversal interface for containment (half done, just need to finish)

After this, we should also consider if we want to either:

  • Switch to only storing an array of unique ids in the sched key. This is quite small/effiicient but would take a lot of work to rework the reader(s)
  • Automatically apply templating where possible by calculating structural hashes of components of the containment tree. More work to implement, but potentially good compression and doesn't leave out information.

trws avatar Oct 21 '25 19:10 trws

Based on recent testing on El Cap, it looks like the first three checkboxes will not be enough to enable JGF there. At some point last week, the El Cap system instance was loaded temporarily with JGF, and the whole instance became unresponsive (see e.g. https://github.com/flux-framework/flux-core/issues/7207). I believe this is still being investigated on the core side.

Completion of the first three checkboxes would shrink JGF for jobs by a huge factor. However, the JGF for the whole instance would be the same size as before, and it seems as though that alone is enough to cause problems. To shrink that, we will need checkbox 4 as well:

Work on template option for config so we can describe a node or a rack and re-use it to describe a system rather than having to write out every individual resource.

However, completion of the first three checkboxes should still be enough for Tuolumne.

jameshcorbett avatar Nov 21 '25 17:11 jameshcorbett