flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

Fluxion considers newly added, down resources as UP

Open grondo opened this issue 1 year ago • 21 comments

On a couple clusters, the sysadmins added new nodes to the configuration that were not yet available, then restarted Flux. The core resource module tracked the nodes as both drained and down/offline, but Fluxion marked the nodes as available for scheduling (evidenced by these nodes being labeled as free in flux resource list).

RFC 28 stipulates that

The scheduler SHALL only allocate the resources associated with an execution target to jobs if the target is up.

Fluxion appears to have marked resources up without having been notified those resources are up via the resource.acquire protocol.

For reference, here's the resource eventlog entry noting that, initially, no execution targets (nodes) are online:

[Dec19 12:26] resource-init restart=true drain={"5":{"timestamp":1702990012.9007058,"reason":"nodediag failed clocksource"},"11":{"timestamp":1702973787.6204503,"reason":"nodediag failed"},"41-42":{"timestamp":1703007506.748992,"reason":"not installed yet"}} online="" exclude="0-2"

grondo avatar Dec 20 '23 00:12 grondo

@garlick tried a simple reproducer (restarting Flux with only rank 0 up, stopping Flux, adding a couple bogus nodes, and starting again) and things worked as expected. So this is no a simple case of Fluxion marking resource up before the resource acquisition protocol does.

At this point it isn't clear what went wrong here.

grondo avatar Dec 20 '23 00:12 grondo

Two data points to add:

  • in my test, flux-coral2 was an older version (and it was not an el cap system)
  • the same thing happened on two different clusters where the sys admin performed the same procedure

garlick avatar Dec 20 '23 00:12 garlick

I don't think flux-coral2 would have any effect in this case. At the moment it doesn't mark any resources as up or down.

jameshcorbett avatar Dec 20 '23 02:12 jameshcorbett

I think I may have hit the same issue through slightly different means. While working on https://github.com/flux-framework/flux-coral2/issues/119, I found that if I initialized Fluxion with JGF in which some vertices were given status: down, Fluxion would in fact treat them as up. The supposedly down resources wouldn't show up in flux ion-resource find status=down. Marking them down afterwards with flux ion-resource set-status /path/to/vertex down had the intended effect--Fluxion wouldn't allocate those vertices and it would list them as down.

jameshcorbett avatar Dec 20 '23 19:12 jameshcorbett

OK, more information. I was able to reproduce somewhat in resource-query. I generated R and JGF by running flux start flux R encode -l | flux ion-R encode in my container. To generate JGF I then filtered it through jq .scheduling to select only the scheduling key.

If I initialized resource-query with JGF, it treated vertices as up and down properly. When I initialized it with R, all vertices were marked up.

Example output:

./flux-sched/el8/resource/utilities/resource-query -frv1exec -L basic_R_down.json 
INFO: Loading a matcher: CA
resource-query> find status=down
INFO: =============================
INFO: EXPRESSION="status=down"
INFO: =============================

bash-4.4$ ./flux-sched/el8/resource/utilities/resource-query -fjgf -L basic_jgf_down.json 
INFO: Loading a matcher: CA
resource-query> find status=down
      ---------core2[1:x]
      ---------core3[1:x]
      ---------core4[1:x]
      ------compute-01[1:x]
      ---cluster0[1:x]
INFO: =============================
INFO: EXPRESSION="status=down"
INFO: =============================

Edit: a further twist. I was wondering why some cores didn't show up in the JGF example above.

bash-4.4$ ./flux-sched/el8/resource/utilities/resource-query -fjgf -L basic_jgf_down.json 
INFO: Loading a matcher: CA
resource-query> find status=down
      ---------core2[1:x]
      ---------core3[1:x]
      ---------core4[1:x]
      ------compute-01[1:x]
      ---cluster0[1:x]
INFO: =============================
INFO: EXPRESSION="status=down"
INFO: =============================
resource-query> find status=up
      ---------core0[1:x]
      ---------core1[1:x]
      ------compute-01[1:x]
      ---cluster0[1:x]
INFO: =============================
INFO: EXPRESSION="status=up"
INFO: =============================

This is despite the node compute-01 being marked down in the file.

jameshcorbett avatar Dec 20 '23 20:12 jameshcorbett

That certainly sounds like a good lead!

I'm not sure it's relevant but on the systems in question (rzvernal and tioga) I believe we currently are not configuring the systems with JGF. Just double checked and tioga is using an R containing

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0",
        "children": {
          "core": "0-23"
        }
      },
      {
        "rank": "1-2",
        "children": {
          "core": "0-63",
          "gpu": "0-7"
        }
      },
      {
        "rank": "3-30",
        "children": {
          "core": "0-63",
          "gpu": "0-7"
        }
      },
      {
        "rank": "31-32",
        "children": {
          "core": "0-95",
          "gpu": "0-3"
        }
      }
    ],
    "starttime": 0.0,
    "expiration": 0.0,
    "nodelist": [
      "tioga[6,10-11,18-41,14-17,42-43]"
    ],
    "properties": {
      "pdebug": "3-26",
      "pci": "27-30",
      "specialnodetype": "31-32"
    }
  }
}

garlick avatar Dec 20 '23 20:12 garlick

Yeah, I knew the clusters didn't use JGF, but it sounds like something is fundamentally funny with the graph initialization. @milroy , any thoughts?

jameshcorbett avatar Dec 20 '23 20:12 jameshcorbett

resource_reader_rv1exec_t::add_vertex() initializes the vertex status to UP. Is that a problem?

https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/readers/resource_reader_rv1exec.cpp#L124

The initial resource.acquire response assumes all resources are DOWN unless they are in the up idset of the response...

grondo avatar Dec 20 '23 21:12 grondo

resource_reader_rv1exec_t::add_vertex() initializes the vertex status to UP. Is that a problem?

The JGF and hwloc readers also initialize the vertices to UP. I'll change the initialization to DOWN and run tests.

milroy avatar Dec 20 '23 21:12 milroy

I've traced the problem to the resource_pool_t copy constructor: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/schema/resource_data.cpp#L44

The ctor is missing the status field which means the resource_pool_t default value is used whenever the copy constructor is called: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/schema/resource_data.hpp#L54

I think this explains the varying manifestation of the error as the copy constructor is called during graph construction when the vecS boost container is resized.

milroy avatar Dec 21 '23 02:12 milroy

@jameshcorbett or @grondo can you give the fix in PR #1125 a try and report what happens?

milroy avatar Dec 21 '23 03:12 milroy

After a bit more investigation it appears there are several different causes for the reported behavior.

  1. Loading rv1exec (with or without scheduling key) with resource-query and the resources are always UP.
  • This behavior is due to lack of support for reading status with the rv1exec reader. The default status in the rv1exec reader is set to UP.
  1. Loading rv1exec via resource.acquire. As @grondo noted in the first comment, the only resources that should be marked up initially are those with the up key.
  • This suggests that the default value of status in all readers should be DOWN. I think this will require the current CI tests to be updated as it will change resource-query expected behavior (which assumes all resources are UP).
  • The up key is set upon initial resource acquisition here: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/modules/resource_match.cpp#L1227 which then executes mark_lazy.
  1. There's ambiguity about the behavior of the three-way interaction when JGF with status fields in the vertices is passed to grow_resource_db, then mark_lazy as part of resource.acquire. And then of course the default JGF vertex status is also UP.
  2. The resource_pool_t ctor didn't include the status member as I indicted in the PR and the previous comment which caused some of the weirdness @jameshcorbett reported in graph initialization.

So there's a lot going on.

milroy avatar Dec 21 '23 09:12 milroy

The resource_pool_t ctor didn't include the status member as I indicted in the PR and the previous comment which caused some of the weirdness @jameshcorbett reported in graph initialization.

I think this also caused some of the behavior observed by the sysadmins.

milroy avatar Dec 21 '23 10:12 milroy

Nice work @milroy and @jameshcorbett!

This suggests that the default value of status in all readers should be DOWN. I think this will require the current CI tests to be updated as it will change resource-query expected behavior (which assumes all resources are UP).

If changing the default status to DOWN would be too invasive, would another solution be to mark all resources DOWN in the sched-fluxion-resource module's callback for the first resource.acquire response, then process the up idset and mark only those resources UP? Initializing all resources DOWN is strictly only required when used with the RFC 28 acquisition protocol.

e.g. something like:

diff --git a/resource/modules/resource_match.cpp b/resource/modules/resource_match.cpp
index 0a3f5095..50459b91 100644
--- a/resource/modules/resource_match.cpp
+++ b/resource/modules/resource_match.cpp
@@ -1220,9 +1220,17 @@ static int update_resource_db (std::shared_ptr<resource_ctx_t> &ctx,
     int rc = 0;
     // Will need to get duration update and set graph metadata when
     // resource.acquire duration update is supported in the future.
-    if (resources && (rc = grow_resource_db (ctx, resources)) < 0) {
-        flux_log_error (ctx->h, "%s: grow_resource_db", __FUNCTION__);
-        goto done;
+    if (resources) {
+        if ((rc = grow_resource_db (ctx, resources)) < 0) {
+            flux_log_error (ctx->h, "%s: grow_resource_db", __FUNCTION__);
+            goto done;
+        }
+        /* initially mark all resources DOWN
+         */
+        if ((rc = mark_now (ctx, "all", resource_pool_t::status_t::DOWN)) < 0) {
+            flux_log_error (ctx->h, "%s: mark all down", __FUNCTION__);
+            goto done;
+        }
     }
     if (up && (rc = mark (ctx, up, resource_pool_t::status_t::UP)) < 0) {
         flux_log_error (ctx->h, "%s: mark (up)", __FUNCTION__);

However, the above patch doesn't work -- it seems to cause segfaults in the sched-fluxion-resource module, so maybe this approach won't work.

grondo avatar Dec 21 '23 15:12 grondo

I think the reason mark_now causes segfaults is that it requires the resource graph to be initialized: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/modules/resource_match.cpp#L1212

This seems like the right track, though. I'm exploring options along similar lines.

milroy avatar Dec 22 '23 00:12 milroy

I think this line is the problem: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/modules/resource_match.cpp#L1178

It looks like the early assumption was that nothing would be marked down via the down key in resource.acquire. That's why the down key isn't respected after graph initialization: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/modules/resource_match.cpp#L1464

milroy avatar Dec 22 '23 01:12 milroy

Correct, the first resource.acquire response should only contain an up idset. If an id is not in that set, then it should be considered down. Subsequent resource.acquire responses can and do have a down idset though, (and that does seem to work in practice)

grondo avatar Dec 22 '23 01:12 grondo

I think this line is the problem:

BTW, that is why I tried to use mark_now() instead of mark_lazy() (clearly didn't work though)

grondo avatar Dec 22 '23 01:12 grondo

Correct, the first resource.acquire response should only contain an up idset.

Ok, then I think the resource module is behaving according to the RFC.

  1. The two mark calls in update_resource_db will both execute mark_lazy as desired since the resource graph isn't initialized yet.
  2. Logic then flows from populate_resource_db in init_resource_graph down to initialize
  3. Then mark sets all resources down via mark_now.
  4. The next block is executed since is_ups_set () is true, and then the mark with the desired up vertices is executed.

milroy avatar Dec 22 '23 08:12 milroy

My hypothesis is that the production issues were caused by the missing member data in the resource_pool_t ctor rather than the resource module. It's possible the JGF reader bug contributed, too, if Flux was using JGF in the scheduling R key.

milroy avatar Dec 22 '23 08:12 milroy

It's possible the JGF reader bug contributed, too, if Flux was using JGF in the scheduling R key.

For the record, the scheduling key was not populated.

garlick avatar Dec 22 '23 14:12 garlick