runhouse icon indicating copy to clipboard operation
runhouse copied to clipboard

Running jobs in a Slurm cluster

Open ThomasA opened this issue 1 year ago • 6 comments

The feature It would be interesting if Runhouse could also interface to a cluster in the form of a an existing Slurm cluster.

Motivation I am part of a team managing a Slurm (GPU) cluster. On the other hand, I have users who are interested in being able to run large language models via Runhouse (https://langchain.readthedocs.io/en/latest/modules/llms/integrations/self_hosted_examples.html). It would be excellent if I could bridge this gap between supply and demand with Runhouse. From what I have read in the documentation so far, Runhouse does not seem to come with an interface to Slurm so far.

What the ideal solution looks like I am completely new to Runhouse, so this may not be the ideal solution model, but I imagine this could be supported as a bring-your-own cluster with a little bit of extra interaction between Runhouse and Slurm to request the necessary resources (maybe from the Cluster factory method) as a job / jobs in Slurm (probably through the Slurm REST API). Once the jobs are running, the nodes involved can be contacted by Runhouse as a BYO cluster.

ThomasA avatar Mar 17 '23 19:03 ThomasA

This is very interesting and we've actually been digging into it for a few weeks. It seems doable pending a few questions about how the cluster is set up. Would you be willing to jump on a quick call just to answer a few questions and talk through our approach?

dongreenberg avatar Mar 21 '23 16:03 dongreenberg

I would like to help out and can probably also help test it in our cluster. I can be available for a call at US-friendly times tomorrow and maybe Friday. Can you email me to coordinate?

ThomasA avatar Mar 22 '23 14:03 ThomasA

I am also interested in getting runhouse interfacing with a Slurm cluster.

Has there been any progress recently on this issue?

andre15silva avatar Apr 17 '23 13:04 andre15silva

Hey Andre we're still in the POC stage - we'd be happy to speak with you to hear more about your requirements and how that integration can work for your setup. Just sent you an email to coordinate

jlewitt1 avatar Apr 17 '23 14:04 jlewitt1

Hello,

I have a similar setup to those above and would like to try out Runhouse. Are there any updates to this issue? I have experience interfacing with Slurm clusters so would be happy to contribute if that would help get this past POC.

eugene-prout avatar Jul 09 '23 10:07 eugene-prout

Hi Eugene thanks for reaching out! We'd love to support slurm, it's on our roadmap along with other compute providers (e.g. k8s), and we hope to get the slurm support into the next or following release. In the meantime we'd be happy to hear your thoughts and possible contribution on this! Sent you an email to discuss further

jlewitt1 avatar Jul 10 '23 21:07 jlewitt1