dask-jobqueue
dask-jobqueue copied to clipboard
What to do with kwargs common to every JobQueueCluster impl (walltime, queue ...)
In a recent PR #200, I've become aware that every JobQueueCluster implementation defines similar kwargs:
- queue
- project
- walltime
- job_extra (except sge)
What should we do with those? Shouldn't they be declared in JobQueueCluster, and just used by the implementations?
Are there any job schedulers that we might support in the future that would not have these keyword arguments?
On Tue, Nov 20, 2018 at 11:27 AM Guillaume Eynard-Bontemps < [email protected]> wrote:
In a recent PR #200 https://github.com/dask/dask-jobqueue/pull/200, I've become aware that every JobQueueCluster implementation defines similar kwargs:
- queue
- project
- walltime
- job_extra (except sge)
What should we do with those? Shouldn't they be declared in JobQueueCluster, and just used by the implementations?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/202, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszL--BkFcGmrq3TY28kRAxo9DHe7yks5uxC1pgaJpZM4Yrd6j .
Currently we only identified HTCondor as a potential newly supported job scheduler. I don't know the tool enough to have an opinion here, it looks quite different from what we already have. Maybe @szs8, @jrbourbeau or @mivade have some opinion?
Other scheduler we might support would be Torque, quite close to PBS, We've also talked about Cobalt and Jetstream in another issue, but they seem quite rare to me (however I may be totally wrong).
But is this really a problem if HTCondor impl does not use them?
A bit old, but relevant. Argonne's HPC systems use the Cobalt job scheduler system and I've recently run across a python project that I'd like to scale using Dask and Job-Queue with. It would appear simple enough to expand the support of the library to support Cobalt schedulers as it's just changing the headers and switching the submit/cancel/run calls. I'll see if I can write an extension that works with my project and PR here.
@Phantom139, yeah yet another job scheduler I had never heard of before :-)!
I think https://github.com/dask/dask-jobqueue/issues/4#issuecomment-399970152 might still be useful to implement a Cluster class for a new job scheduler. Let us know if you get stuck!
Note that if Cobalt is reasonably close to another job scheduler that we already support in dask-jobqueue it may be a good idea to inherit from this cluster class.
@lesteve Thanks for the link, that does contain some useful information that may help with my work.
At the moment, I've got the work I've done so far here. I didn't include the doc-string in my code, but that's something that can be done later. I'm getting fairly close, but I just need to work out some specific errors that are being pushed by the cluster when trying to run a job.
I may end up dropping what I have to inherit from another cluster class, but I think writing its own instance may give it a bit more control over the specifics, so I'll continue unless I run into a roadblock I can't get past.
Great to here you are getting fairly close! If you get errors that are hard to make sense of, don't hesitate to create a separate issue about Cobalt support. The doc about "How to debug" could also be useful because it is not specific to the particular job scheduler that you use: https://jobqueue.dask.org/en/latest/debug.html
Here are some observations I have made so far in my work. I am using Argonne's Theta cluster for testing purposes.
- Submitting a job to Theta (COBALT) pushes the submission information through stderr for some silly reason, this in turn pushes return code 1 even though the job successfully reaches the queue (I checked via https://status.alcf.anl.gov/theta/activity)
- I'm not sure if this is a product of the newline between 'shebang' and the headers, but setting the #COBALT headers in the file isn't being recognized and instead I pass these arguments as command line parameters.
I'm not sure if there's easy ways around this (I'll ask around and see if I can find out why output is going to stderr), so I'm probably going to have a class instance that overrides many of the inplace methods for the abstract class, but I suppose if the end result is a working product, then that's all that really matters.
Interesting ... at this point, could you create a separate issue specifically about Cobalt support so that we can have a focused discussion there?
Please also copy and paste the relevant information on the Cobalt issue so that it is reasonably self-contained.
We just declared job_extra_directives (see #577) in the JobQueueCluster class. I think we should do the same thing with other args listed above.