flux-sched
flux-sched copied to clipboard
Fluxion considers newly added, down resources as UP
On a couple clusters, the sysadmins added new nodes to the configuration that were not yet available, then restarted Flux.
The core resource
module tracked the nodes as both drained and down/offline, but Fluxion marked the nodes as available for scheduling (evidenced by these nodes being labeled as free
in flux resource list
).
RFC 28 stipulates that
The scheduler SHALL only allocate the resources associated with an execution target to jobs if the target is up.
Fluxion appears to have marked resources up without having been notified those resources are up via the resource.acquire protocol.
For reference, here's the resource eventlog entry noting that, initially, no execution targets (nodes) are online:
[Dec19 12:26] resource-init restart=true drain={"5":{"timestamp":1702990012.9007058,"reason":"nodediag failed clocksource"},"11":{"timestamp":1702973787.6204503,"reason":"nodediag failed"},"41-42":{"timestamp":1703007506.748992,"reason":"not installed yet"}} online="" exclude="0-2"
@garlick tried a simple reproducer (restarting Flux with only rank 0 up, stopping Flux, adding a couple bogus nodes, and starting again) and things worked as expected. So this is no a simple case of Fluxion marking resource up before the resource acquisition protocol does.
At this point it isn't clear what went wrong here.
Two data points to add:
- in my test, flux-coral2 was an older version (and it was not an el cap system)
- the same thing happened on two different clusters where the sys admin performed the same procedure
I don't think flux-coral2 would have any effect in this case. At the moment it doesn't mark any resources as up or down.
I think I may have hit the same issue through slightly different means. While working on https://github.com/flux-framework/flux-coral2/issues/119, I found that if I initialized Fluxion with JGF in which some vertices were given status: down
, Fluxion would in fact treat them as up. The supposedly down resources wouldn't show up in flux ion-resource find status=down
. Marking them down afterwards with flux ion-resource set-status /path/to/vertex down
had the intended effect--Fluxion wouldn't allocate those vertices and it would list them as down.
OK, more information. I was able to reproduce somewhat in resource-query. I generated R and JGF by running flux start flux R encode -l | flux ion-R encode
in my container. To generate JGF I then filtered it through jq .scheduling
to select only the scheduling
key.
If I initialized resource-query with JGF, it treated vertices as up and down properly. When I initialized it with R, all vertices were marked up.
Example output:
./flux-sched/el8/resource/utilities/resource-query -frv1exec -L basic_R_down.json
INFO: Loading a matcher: CA
resource-query> find status=down
INFO: =============================
INFO: EXPRESSION="status=down"
INFO: =============================
bash-4.4$ ./flux-sched/el8/resource/utilities/resource-query -fjgf -L basic_jgf_down.json
INFO: Loading a matcher: CA
resource-query> find status=down
---------core2[1:x]
---------core3[1:x]
---------core4[1:x]
------compute-01[1:x]
---cluster0[1:x]
INFO: =============================
INFO: EXPRESSION="status=down"
INFO: =============================
Edit: a further twist. I was wondering why some cores didn't show up in the JGF example above.
bash-4.4$ ./flux-sched/el8/resource/utilities/resource-query -fjgf -L basic_jgf_down.json
INFO: Loading a matcher: CA
resource-query> find status=down
---------core2[1:x]
---------core3[1:x]
---------core4[1:x]
------compute-01[1:x]
---cluster0[1:x]
INFO: =============================
INFO: EXPRESSION="status=down"
INFO: =============================
resource-query> find status=up
---------core0[1:x]
---------core1[1:x]
------compute-01[1:x]
---cluster0[1:x]
INFO: =============================
INFO: EXPRESSION="status=up"
INFO: =============================
This is despite the node compute-01
being marked down in the file.
That certainly sounds like a good lead!
I'm not sure it's relevant but on the systems in question (rzvernal and tioga) I believe we currently are not configuring the systems with JGF. Just double checked and tioga is using an R containing
{
"version": 1,
"execution": {
"R_lite": [
{
"rank": "0",
"children": {
"core": "0-23"
}
},
{
"rank": "1-2",
"children": {
"core": "0-63",
"gpu": "0-7"
}
},
{
"rank": "3-30",
"children": {
"core": "0-63",
"gpu": "0-7"
}
},
{
"rank": "31-32",
"children": {
"core": "0-95",
"gpu": "0-3"
}
}
],
"starttime": 0.0,
"expiration": 0.0,
"nodelist": [
"tioga[6,10-11,18-41,14-17,42-43]"
],
"properties": {
"pdebug": "3-26",
"pci": "27-30",
"specialnodetype": "31-32"
}
}
}
Yeah, I knew the clusters didn't use JGF, but it sounds like something is fundamentally funny with the graph initialization. @milroy , any thoughts?
resource_reader_rv1exec_t::add_vertex()
initializes the vertex status to UP
. Is that a problem?
https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/readers/resource_reader_rv1exec.cpp#L124
The initial resource.acquire
response assumes all resources are DOWN
unless they are in the up
idset of the response...
resource_reader_rv1exec_t::add_vertex() initializes the vertex status to UP. Is that a problem?
The JGF and hwloc readers also initialize the vertices to UP
. I'll change the initialization to DOWN
and run tests.
I've traced the problem to the resource_pool_t
copy constructor: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/schema/resource_data.cpp#L44
The ctor is missing the status
field which means the resource_pool_t
default value is used whenever the copy constructor is called: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/schema/resource_data.hpp#L54
I think this explains the varying manifestation of the error as the copy constructor is called during graph construction when the vecS
boost container is resized.
@jameshcorbett or @grondo can you give the fix in PR #1125 a try and report what happens?
After a bit more investigation it appears there are several different causes for the reported behavior.
- Loading
rv1exec
(with or without scheduling key) withresource-query
and the resources are alwaysUP
.
- This behavior is due to lack of support for reading status with the
rv1exec
reader. The default status in therv1exec
reader is set toUP
.
- Loading
rv1exec
viaresource.acquire
. As @grondo noted in the first comment, the only resources that should be markedup
initially are those with theup
key.
- This suggests that the default value of
status
in all readers should beDOWN
. I think this will require the current CI tests to be updated as it will changeresource-query
expected behavior (which assumes all resources areUP
). - The
up
key is set upon initial resource acquisition here: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/modules/resource_match.cpp#L1227 which then executesmark_lazy
.
- There's ambiguity about the behavior of the three-way interaction when JGF with
status
fields in the vertices is passed togrow_resource_db
, thenmark_lazy
as part ofresource.acquire
. And then of course the default JGF vertex status is alsoUP
. - The
resource_pool_t
ctor didn't include thestatus
member as I indicted in the PR and the previous comment which caused some of the weirdness @jameshcorbett reported in graph initialization.
So there's a lot going on.
The resource_pool_t ctor didn't include the status member as I indicted in the PR and the previous comment which caused some of the weirdness @jameshcorbett reported in graph initialization.
I think this also caused some of the behavior observed by the sysadmins.
Nice work @milroy and @jameshcorbett!
This suggests that the default value of status in all readers should be DOWN. I think this will require the current CI tests to be updated as it will change resource-query expected behavior (which assumes all resources are UP).
If changing the default status to DOWN would be too invasive, would another solution be to mark all resources DOWN in the sched-fluxion-resource
module's callback for the first resource.acquire
response, then process the up
idset and mark only those resources UP? Initializing all resources DOWN is strictly only required when used with the RFC 28 acquisition protocol.
e.g. something like:
diff --git a/resource/modules/resource_match.cpp b/resource/modules/resource_match.cpp
index 0a3f5095..50459b91 100644
--- a/resource/modules/resource_match.cpp
+++ b/resource/modules/resource_match.cpp
@@ -1220,9 +1220,17 @@ static int update_resource_db (std::shared_ptr<resource_ctx_t> &ctx,
int rc = 0;
// Will need to get duration update and set graph metadata when
// resource.acquire duration update is supported in the future.
- if (resources && (rc = grow_resource_db (ctx, resources)) < 0) {
- flux_log_error (ctx->h, "%s: grow_resource_db", __FUNCTION__);
- goto done;
+ if (resources) {
+ if ((rc = grow_resource_db (ctx, resources)) < 0) {
+ flux_log_error (ctx->h, "%s: grow_resource_db", __FUNCTION__);
+ goto done;
+ }
+ /* initially mark all resources DOWN
+ */
+ if ((rc = mark_now (ctx, "all", resource_pool_t::status_t::DOWN)) < 0) {
+ flux_log_error (ctx->h, "%s: mark all down", __FUNCTION__);
+ goto done;
+ }
}
if (up && (rc = mark (ctx, up, resource_pool_t::status_t::UP)) < 0) {
flux_log_error (ctx->h, "%s: mark (up)", __FUNCTION__);
However, the above patch doesn't work -- it seems to cause segfaults in the sched-fluxion-resource
module, so maybe this approach won't work.
I think the reason mark_now
causes segfaults is that it requires the resource graph to be initialized: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/modules/resource_match.cpp#L1212
This seems like the right track, though. I'm exploring options along similar lines.
I think this line is the problem: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/modules/resource_match.cpp#L1178
It looks like the early assumption was that nothing would be marked down via the down
key in resource.acquire
. That's why the down
key isn't respected after graph initialization: https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/modules/resource_match.cpp#L1464
Correct, the first resource.acquire
response should only contain an up
idset. If an id is not in that set, then it should be considered down. Subsequent resource.acquire
responses can and do have a down
idset though, (and that does seem to work in practice)
I think this line is the problem:
BTW, that is why I tried to use mark_now()
instead of mark_lazy()
(clearly didn't work though)
Correct, the first resource.acquire response should only contain an up idset.
Ok, then I think the resource module is behaving according to the RFC.
- The two
mark
calls inupdate_resource_db
will both executemark_lazy
as desired since the resource graph isn't initialized yet. - Logic then flows from
populate_resource_db
ininit_resource_graph
down to initialize - Then
mark
sets all resources down viamark_now
. - The next block is executed since
is_ups_set ()
istrue
, and then themark
with the desiredup
vertices is executed.
My hypothesis is that the production issues were caused by the missing member data in the resource_pool_t
ctor rather than the resource module. It's possible the JGF reader bug contributed, too, if Flux was using JGF in the scheduling R key.
It's possible the JGF reader bug contributed, too, if Flux was using JGF in the scheduling R key.
For the record, the scheduling key was not populated.