wlm-operator icon indicating copy to clipboard operation
wlm-operator copied to clipboard

Generalizing to other job schedulers

Open jakirkham opened this issue 6 years ago • 6 comments

As there are many different job schedulers used on HPC, it would be interesting to know to what extent the work here could be generalized to apply to other job schedulers to cover more use cases. For instance what would it take to get this to work on SGE or LSF or some other arbitrary job scheduler? Would it be possible to parameterize things a bit? To what extent is this tied to SLURM specifically? Thanks in advance for your thoughts. 🙂

jakirkham avatar Jun 20 '19 05:06 jakirkham

Hello @jakirkham,

The only part tied to slurm specifically is red-box and virtual-kubelet provider (a bit). Core logic is in red-box, it implements WorkloadManager interface, and the rest elements use that interface to communicate. So if anyone wants to extend this, new WorkloadManager implementation (new red-box) is the way to go :)

sashayakovtseva avatar Jun 20 '19 05:06 sashayakovtseva

Potentially, operator can work with any WLM. The only thing you need to do, is to implement GRPc server corresponding to our workload.proto spec. And use your implementation instead red-box(which is actually just workload.proto implementation for SLURM)

pisarukv avatar Jun 20 '19 08:06 pisarukv

@jakirkham Thanks for stopping by! I just want to say that we'd be more than happy to work with the community to accept contributions which are enabling other WLMs into this architecture.

bauerm97 avatar Jun 20 '19 13:06 bauerm97

great discussion. just wondering if a generic implementation on an open standard like DRMAA would be useful for that -> https://github.com/dgruber/drmaa

dgruber avatar Jun 20 '19 14:06 dgruber

Actually we have taken a look at DRMAA. The second version(drmaa2) looks perfect for us, but it seems not widely used. About DRMAA v1 it seems to miss some important for us features. For example, it's very important for us to have a possibility to get an information about WLM partitions(queues) and resources they have. At this moment I'm not sure if it's possible with the first version.

pisarukv avatar Jun 20 '19 15:06 pisarukv

Yeah, agreed. Adoption could be better. I started a generic implementation of DRMAA2 in Go (https://github.com/dgruber/drmaa2os). An initial cli wrapper for slurm exists (https://github.com/dgruber/drmaa2os/tree/master/pkg/jobtracker/slurmcli). Could serve as a starting point...deserves certainly more attention. Contributions welcome!

dgruber avatar Jun 20 '19 18:06 dgruber