Mark Grondona
Mark Grondona
> I might be trivializing, but I thought partial release from the housekeeping exec to the scheduler could be implemented as it is here, except we'd support resource-update to show...
> I think that using flux logger and flux resource drain in the housekeeping jobs would be acceptable My worry is that unanticipated errors from a script or set or...
If these housekeeping scripts are emulated like jobs could we just log the errors in an output eventlog? Or are they only jobs as far as job-list is concerned, so...
> It seems like whenever we talk about the current prolog / epilog it's described as something that you're not quite happy with. Would this be further cementing that implementation,...
> FWIW I added a default 30m timeout to the systemd unit file and also code to drain the node when the unit start fails. Great! Just FYI the timeout...
> thought I had tested earlier that the imp hung around and forwarded signals, and that it would accept signals from the flux user but the imp does not seem...
Hm, let me remind myself how the IMP persistence works real quick.
Oh, it is `flux-imp exec` that lingers, not `flux-imp run`. You could try `flux-imp kill` in this situation if the target PID is in a cgroup owned by the flux...
This issue occurred again today on elcap. Again ~1000 nodes affected.
More info on Slurm's Multi Category Security [here](https://slurm.schedmd.com/mcs.html). Use of MCS for sharing nodes implies that node exclusive scheduling is not being used. We haven't enabled that in Flux on...