James Corbett comments

Results 83 comments of


                                            James Corbett

Tolerate some node-specific failures in job prolog

> After typing that up, I realized one issue with a special exit code is that exiting early from the prolog due to a potentially nonfatal error is probably not...

DOWN vertices not reconsidered after coming back UP

I'll check the match policy, but I'll also see if I can reproduce locally.

DOWN vertices not reconsidered after coming back UP

@behlendorf has been seeing this issue repeatedly on Hetchy. He submits a rabbit job, i.e. one that has `ssd` entries in the jobspec. However, all `ssd` vertices are marked down,...

Partial cancel not releasing rabbit resources (?)

I reloaded the resource and fluxion modules and scheduling went back to working as expected at first, but then as I ran jobs they eventually became stuck in SCHED. ```...

Partial cancel not releasing rabbit resources (?)

The issue seems to have been introduced between 0.36.1 and 0.37.0.

Partial cancel not releasing rabbit resources (?)

I can reproduce in the flux-coral2 environment locally or on LC clusters, but there are a bunch of plugins loaded. The simplest thing I have is the following I think....

Partial cancel not releasing rabbit resources (?)

@milroy I have a branch in my fork that repros the issue https://github.com/jameshcorbett/flux-sched/tree/issue-1284 Interestingly while fooling around with it I noticed that the issue only comes up if the jobspec...

Partial cancel not releasing rabbit resources (?)

Ok @milroy I think I have an improved reproducer at https://github.com/jameshcorbett/flux-sched/tree/issue-1284

Partial cancel not releasing rabbit resources (?)

With this patch that @trws and I talked about ``` diff --git a/qmanager/policies/base/queue_policy_base.hpp b/qmanager/policies/base/queue_policy_base.hpp index 6fa2e44d..e9fd1166 100644 --- a/qmanager/policies/base/queue_policy_base.hpp +++ b/qmanager/policies/base/queue_policy_base.hpp @@ -666,7 +666,7 @@ class queue_policy_base_t : public resource_model::queue_adapter_base_t...

Partial cancel not releasing rabbit resources (?)

OK, interesting. I will have to try it out. I just realized that we're switching the EAS clusters to the `rv1` match format, from `rv1_nosched`. Is partial cancel going to...