flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

Condensing JGF

Open jameshcorbett opened this issue 1 year ago • 6 comments

JGF is verbose, and Rabbit-y JGF on elcap systems can become very large. We discussed offline several ways to shrink JGF, both while maintaining the same format and compatibility with the standard (which?) and breaking the standard to achieve greater reductions in size.

Performance tests are necessary to see just how bad the problems are, so we can decide how radical of changes we need to make.

jameshcorbett avatar Jul 25 '24 00:07 jameshcorbett

To get this going, here's the JGF specification website: https://jsongraphformat.info/

We're currently using JGF v1, and I don't think we necessarily want to switch to v2 unless there's a material win for doing so. The main things that I can think of (@milroy I'm pretty sure I'm missing at least one of your suggestions, please correct me) that we've discussed are:

  1. we send a lot of default data over the wire. Here's a vertex in flux JGF:

    {
      "id": "2",
      "metadata": {
        "type": "node",
        "basename": "node",
        "name": "node0",
        "id": 0,
        "uniq_id": 2,
        "rank": -1,
        "exclusive": false,
        "unit": "",
        "size": 1,
        "paths": {
          "containment": "/tiny0/rack0/node0"
        }
      }
    },
    

    Just off top of my head:

    1. no need to send default values
      1. basename defaults to type
      2. name defaults to {type}{id}
      3. rank defaults to -1 when resource doesn't belong to a rank, we could also use -2 to say "same as ancestor" to avoid sending it in almost all cases
      4. exclusive is almost never used except in allocated JGF, and then defaults to false
      5. unit is almost never used, and defaults to "each" or "empty"
      6. size is almost always 1, except for ram
      7. paths can be calculated entirely from edges, which we're sending separately anyway as long as we guarantee that jgf always includes all ancestor vertices up to the root
      8. uniq_id always matches the outer "id" field
    2. the inner "id" is the logical id under the parent vertex, if we have the uniq_id, not sure we need that

    With just that, we could send this as essentially equivalent:

    {
      "id": "2",
      "metadata": {
        "type": "node",
      }
    },
    

    Edges are a bit harder, but since we're using a bidirectional graph finally we should be able to remove the value of the name object, and containment would be the default, so we could take these:

    {
      "source": "2",
      "target": "4",
      "metadata": {
        "name": {
          "containment": "contains"
        }
      }
    },
    {
      "source": "1",
      "target": "5",
      "metadata": {
        "name": {
          "power": "supplies_to"
        }
      }
    },
    

    and make them

    {
      "source": "2",
      "target": "4",
    },
    {
      "source": "1",
      "target": "5",
      "metadata": {
        "type": "power"
      }
    },
    

    All of that is still JGF compatible

  2. Structural sharing, or expansion like GRUG does, as @milroy noted could help a lot. JGF doesn't support this natively, but we could use the metadata to include the information, or use subgraphs with references. Upside to this is the vast majority of our vertices are identical below some point, so we could take say a node, make an exemplar graph as a separate graph, then have a node that says "template these based on this hostlist" or something and we're sending relatively little more than with RV1

  3. If we're willing to throw JGF the spec out the window, the cheapest, easiest win would be to switch the edge storage from an array of objects with keys to an array of tuple-like arrays like this:

    [
      // [source, target, (optional) type]
      [2,4],
      [1,5,"power"]
    ]
    

    Or even switch to a stream of json objects like some other formats use so we can do the same to vertices and process them without having to build the whole object in memory, but this will break any and all JGF parsers.

trws avatar Jul 29 '24 21:07 trws

CC: @garlick, @grondo

trws avatar Jul 29 '24 21:07 trws

As I understand it I'm on the hook to try out elcap-sized JGF and see what the performance is like, which I intend to do as soon as I get some pressing rabbit issues out of the way.

jameshcorbett avatar Jul 31 '24 01:07 jameshcorbett

OK so on a representative system:

Before the changes introduced by #1293 and #1297, JGF was 8458557 bytes. After the changes in those two PRs, it dropped to 2122574 bytes. If the path field of vertices could be dropped in all cases (currently it has to be supplied in all cases), that would shrink it to 1298046 bytes. That makes it around 1/7 the original size. Not bad.

jameshcorbett avatar Sep 19 '24 05:09 jameshcorbett

If #1299 goes in, the last obvious work I'm aware of will be to drop each vertex's .metadata.paths.containment which in theory should be simple, however I'm not quite sure how to actually implement it. Also, there is some code that fetches all the paths to SSD vertices in JGF in order to be able to mark the vertices as UP or DOWN via set-status RPCs so that would need some sort of replacement.

jameshcorbett avatar Oct 01 '24 02:10 jameshcorbett

On Hetchy, JGF for jobs appears to be around 15K per node.

jameshcorbett avatar Oct 08 '24 18:10 jameshcorbett

Current Status

This came up again during a dev meeting with @trws and @milroy. With #1406 merged, partial cancel appears to be working on rabbit systems where the resources are described with JGF. That had previously blocked the use of JGF on any large system, where partial cancel woulfd be required. However, now the blocker is the size and resource usage of JGF on large systems.

Tuolumne appears to be somewhat marginal. The archive is large even without JGF. An emulator might enable us to determine what kinds of resource usage to expect if we were to enable JGF. Also, we could potentially enable it for a few days and then disable it, and see how much the archive grows in the meanwhile.

Templates

For a solution to the problem, one approach would be to add templates for resource types. This could be done in config, or somewhat dynamically. Nodes are a good candidate for template-ization. On Elcap, for instance, there is only one kind of node, since login and compute nodes are the same.

There would need to be a way to separate the template from attributes of each specific instance--for instance, ID, name, path, and properties.

Provide a method for a sub-instance to figure out what its complete graph is

When an instance is configured with JGF, it must use the rv1 match format, which makes Fluxion write out an entire JGF subgraph for each job. This allows each child instance to initialize its resources with the JGF subgraph by the parent instance. However, for correctness, the parent instance would only need to write out an idset of vertices and their paths. The sub-instance could initialize itself using R without the scheduling key. However, there would need to be some method for a subgraph to request full details about its resources, as seen by the parent instance.

For rabbit scheduling, only the system instance needs to know about rabbits, so child instances could be fine without the rabbit-augmented JGF.

Issues

@milroy mentioned that changing JGF could introduce issues with elasticity. He indicated that the satisfiability module could be used for determining whether or not a new resource is an instance of a template or not.

jameshcorbett avatar Oct 15 '25 22:10 jameshcorbett