flux-core use case: core scheduled system where users cannot share nodes

Problem: some CI jobs benefit from being core scheduled for efficiency, but for security reasons, they need to avoid sharing nodes with other users.

I'm not sure I recall ever talking about this use case before but it seems straightforward. Would it be the scheduler's job to enforce the constraint?

Maybe there are other fun ways to accomplish the same thing with existing tools. :shrug:

Feb 27 '25 01:02 garlick

Building on for context, some other institutions have recently have started using the MCSPlugin in Slurm to get the benefits of a CPU scheduled system (more tightly packing small jobs from users -- mostly for CI/CD and development purposes) without compromising on the security of node scheduled systems.

It would be really great if we could implement this at LLNL to improve the throughput of small jobs on dev clusters.

Feb 27 '25 17:02 alecbcs

More info on Slurm's Multi Category Security here.

Use of MCS for sharing nodes implies that node exclusive scheduling is not being used. We haven't enabled that in Flux on any of our clusters yet. I don't think Fluxion supports a different policy per queue, but I may not be correct, and it may not be that difficult to overcome that limitation (e.g. using the frobnicator).

This does seem like something that would have to be supported by the scheduler. One idea would be to have a new constraint operator for this purpose, i.e. some kind of security label check that could check the job user, account, or arbitrary label if any jobs are assigned to the given node. Might have to be careful how that one's defined since you'd want to make it easy for the user (or a frobnicator) to add a constraint that matches "empty or label/user/account matches X".

If used for CI purposes, I'm guess what we'd want is to be able to allow non-exclusive scheduling in a ci queue with some kind of automatic constraint that avoided scheduling different users or group or account or ... on the same nodes?

I do know @trws has some valid concerns about the current implementation of constraints, so it might be best to get his opinion here. Maybe something that sets dynamic properties on nodes would be a bit simpler at first.

Feb 27 '25 21:02 grondo

Ah, my apologies @alecbcs, I might have misled you yesterday -- I thought we supported different match policies per queue. Although, as @grondo said, it might be possible to overcome that limitation if the only policy difference between the ci queue and batch or debug is node exclusivity.

Feb 27 '25 22:02 wihobbs

It depends on what you mean by policy in this context. We don't support different match policies, currently, but we do support different queueing policies. We've given some thought to how to do something like this in the past, because it would be an efficiently improvement certainly. It ended up on the back burner largely because we get a somewhat similar effect by running a user-level flux instance that isn't node-exclusive under the node-exclusive system scheduler. This would let us be much more flexible though.

As a first cut, as long as the resource module receives a "user" value somehow, we could set the user as a property on a node as part of allocating any part of the node. I kinda like the idea of doing that just in general, it would mean we could ask questions like "list all nodes user X is running on". The trick would be adding something to mean "only worry about node exclusivity if this property is not set" and I'm not sure how we'd go about that yet.

Feb 28 '25 20:02 trws

Thanks @trws. I'll just note here that the Slurm implementation allows matching on user, group, account, or a generic "security label" (e.g. maybe project), so there may need to be some way to collect a set of job metadata and install it as something that can be part of a constraint or some other match parameter.

Feb 28 '25 20:02 grondo

That's really good to know. If we go with the property approach, that should be pretty trivial (as would arbitrary other ones) if we plan it as a more general mechanism. Our label implementation is a generic string=>string map, and we can do matches on labels and their values, so that would be a pretty good way to start (I could see reasons not to do it that way longer term but I think it would get us pretty far). Maybe we could add something to jobspec or in the generic attributes for "ephemeral labels" that are associated with a job and apply to a resource or the node? Then it's just two additions to fluxion:

Read those and apply them, probably an hour's work with adding tests
Figure out how to do "allow nodes with these labels non-exclusively in exclusive mode", which also includes tweaking how we do exclusive scheduling so we don't mark the node exclusive for real but check exclusivity instead (pretty sure we have everything we need for this, it just doesn't work that way right now)

I could see the first of these being useful for other things actually, like a user who prefers certain workloads to share nodes and others not to, even in user-mode. They could just set a label for "running_huge_mem" and requirements to say `not label:running_huge_mem" to avoid co-scheduling things that need more than half ram maybe.

Mar 02 '25 17:03 trws

From the user perspective, I think any of these proposed solutions work for me as long as we can enforce exclusivity between users at the admin level to meet the security requirements to turn this on. (E.g. so users can't disable the policy and get on a node with another user.)

This could be big impact for some of the devops workloads in WSC, since those typically run under a user service user per project so they could share nodes between a bunch jobs from the same team / project and greatly improve throughput.

Mar 04 '25 23:03 alecbcs

Ok, it wasn't clear if the "security label" we were wanting to use initially was just "same user". That may be a problem we could solve "in a fun way with existing tools" as @garlick put it.

@alecbcs, can you give extra detail on the specifics of this particular use case? Is it the case that CI jobs are being allocated nodes exclusively, but they aren't using all resources, or do the CI users want to be able to oversubscribe resources by some factor on allocated nodes to allow overloading N CI jobs onto the nodes?

In either case, we might be able to come up with a tool Jacamar could use to find an allocation that matches the existing user and isn't already oversubscribe, and submit the CI job to that instance immediately instead of queueing up a job in the system instance.

Mar 06 '25 16:03 grondo

flux-core flux-core copied to clipboard

use case: core scheduled system where users cannot share nodes

flux-core
flux-core copied to clipboard