Mark Grondona comments

Results 605 comments of


                                            Mark Grondona

WIP: job-manager: add support for housekeeping scripts with partial release of resources

> I might be trivializing, but I thought partial release from the housekeeping exec to the scheduler could be implemented as it is here, except we'd support resource-update to show...

WIP: job-manager: add support for housekeeping scripts with partial release of resources

> I think that using flux logger and flux resource drain in the housekeeping jobs would be acceptable My worry is that unanticipated errors from a script or set or...

WIP: job-manager: add support for housekeeping scripts with partial release of resources

If these housekeeping scripts are emulated like jobs could we just log the errors in an output eventlog? Or are they only jobs as far as job-list is concerned, so...

WIP: job-manager: add support for housekeeping scripts with partial release of resources

> It seems like whenever we talk about the current prolog / epilog it's described as something that you're not quite happy with. Would this be further cementing that implementation,...

WIP: job-manager: add support for housekeeping scripts with partial release of resources

> FWIW I added a default 30m timeout to the systemd unit file and also code to drain the node when the unit start fails. Great! Just FYI the timeout...

WIP: job-manager: add support for housekeeping scripts with partial release of resources

> thought I had tested earlier that the imp hung around and forwarded signals, and that it would accept signals from the flux user but the imp does not seem...

WIP: job-manager: add support for housekeeping scripts with partial release of resources

Hm, let me remind myself how the IMP persistence works real quick.

WIP: job-manager: add support for housekeeping scripts with partial release of resources

Oh, it is `flux-imp exec` that lingers, not `flux-imp run`. You could try `flux-imp kill` in this situation if the target PID is in a cgroup owned by the flux...

1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected

This issue occurred again today on elcap. Again ~1000 nodes affected.

use case: core scheduled system where users cannot share nodes

More info on Slurm's Multi Category Security [here](https://slurm.schedmd.com/mcs.html). Use of MCS for sharing nodes implies that node exclusive scheduling is not being used. We haven't enabled that in Flux on...