dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

Template job scripts with Jinja

Open wtbarnes opened this issue 5 years ago • 31 comments

This PR addresses #133 and is a first pass at using Jinja2 to template job submission scripts. See also #338. In general, the approach I've used here is to store the templates as separate files in templates/ such that there is a separate template for the worker command, the top level job script, and then a separate file for the job header for each scheduler. Rather than appending strings to a list, all of the formatting logic is now contained in the templates using the Jinja syntax.

Thus far, I've modified the top level job script and the worker command to use Jinja2. The following job scheduler headers have been retemplated with Jinja:

  • [x] PBS
  • [x] HTCondor
  • [x] LSF
  • [x] Moab
  • [x] OAR
  • [x] SGE
  • [x] SLURM

Closes #133, closes #321.

wtbarnes avatar Nov 24 '19 06:11 wtbarnes

I don't have the time to check this right now, but I just wanted to say thanks for working on this!!

guillaumeeb avatar Nov 27 '19 15:11 guillaumeeb

@guillaumeeb no problem at all. I understand that PRs of this size require a lot of effort on the part of maintainers!

It looks like the failing SGE job is due to problem with installing some packages from apt. Could someone restart that job to see if that fixes things? I don't think the failure is related to anything in this PR.

wtbarnes avatar Nov 27 '19 16:11 wtbarnes

Thanks a lot, as for @guillaumeeb I may not have a lot of time to look at this in detail before some time.

What I would like in an ideal world:

  • fix a few things and release a 0.7.1 version, hopefully in a matter of weeks.
  • review this PR, merge it and maybe release a 0.8, hopefully in January

lesteve avatar Nov 28 '19 15:11 lesteve

Hmmm I have restarted the build and the SGE error is similar. Probably something to look into ...

lesteve avatar Nov 28 '19 15:11 lesteve

OK restarting the SGE job a few more times fixed the issue ...

Note this intermittent issue has been seen on master from time to time, e.g. https://travis-ci.org/dask/dask-jobqueue/jobs/617519102.

lesteve avatar Dec 01 '19 15:12 lesteve

Thanks @lesteve! I know this is a big PR and will take some time to review.

I currently have access to a PBS cluster and will test this out "in the wild." It'd be great if anyone with access to systems running Condor, LSF, or OAR could give this a go as well. I'm sure there are some hangups related to job script formatting that need to be worked out.

wtbarnes avatar Dec 01 '19 15:12 wtbarnes

Just wanted to check back in and see if anyone has had time to test this out or look it over 🙂 Happy to incorporate any feedback!

If accepted, is the plan still to include this in the 0.8 release and is there a rough timeline for that release?

wtbarnes avatar Feb 20 '20 04:02 wtbarnes

Hey @wtbarnes, we're not forgetting you and are again really happy you took a stab at this.

I guess the first planning describe by @lesteve above got a bit delayed, we're probably too busy currently to make the small progress we wanted to achieve. It is thus difficult to give you a timeline as we're already late... We definitely plan to take a closer look at this one for the 0.8 release.

In the mean time and if you feel like it, you could take a look at the few issues identified by @lesteve for a 0.7.1 release and try to fix them? I think there are easy enough to fix (but may be wrong).

guillaumeeb avatar Feb 21 '20 13:02 guillaumeeb

I guess the first planning describe by @lesteve above got a bit delayed, we're probably too busy currently to make the small progress we wanted to achieve. It is thus difficult to give you a timeline as we're already late... We definitely plan to take a closer look at this one for the 0.8 release.

Thanks for the quick response @guillaumeeb. I'm not really in any hurry here, just wanted to check in. I know that release timelines are always a bit optimistic :smile:, especially when all the work is done by volunteers!

In the mean time and if you feel like it, you could take a look at the few issues identified by @lesteve for a 0.7.1 release and try to fix them? I think there are easy enough to fix (but may be wrong).

Sure, I can try to look these over. These would be just those issues/PRs milestoned for 0.7.1?

wtbarnes avatar Feb 21 '20 15:02 wtbarnes

  • Shouldn't we add jinja2 in the requirement file?

Yes! I meant to do this.

  • If a user wants to customize the jinja templates for any reason (which is the final goal here, if a cluster job queuing system has some specific quirk, we want a elegant solution to bypass it), how is that done? Maybe this is already feasible by dropping a new template somewhere, or we need to add a way to override the default templates. In any case this should be documented somewhere in the docs. Can you work on this?

Jinja allows you to specify multiple possible template loaders. In my current implementation, the templates are only being loading from the package,

https://github.com/wtbarnes/dask-jobqueue/blob/934e403bb81b929458c240e7f1a0aa3a02108fda/dask_jobqueue/core.py#L244

This can be easily modified to accept custom templates that override the default choices,

loader = ChoiceLoader([
    DictLoader(custom_templates),
    PackageLoader("dask-jobqueue", "templates")
])
env = Environment(loader=loader)

where custom_templates is a dictionary containing any templates the user wants to override. If the dictionary is empty, each template will just fall back to those specified in the package.

The bigger question is what is the best entry point for the user to specify these templates? The Jinja environment lives on the Job base class so we would need to work out how best to propagate this information from cluster instantiation to the instantiation of each job.

My personal opinion is that this could probably be its own PR as it is complicated enough and really tackles a separate (though certainly related) issue. However, if people would rather see these changes all merged in at once, I'm happy to give that a go as well.

wtbarnes avatar Feb 21 '20 16:02 wtbarnes

My personal opinion is that this could probably be its own PR as it is complicated enough and really tackles a separate (though certainly related) issue. However, if people would rather see these changes all merged in at once, I'm happy to give that a go as well.

Okay, that sounds fair!

guillaumeeb avatar Feb 21 '20 19:02 guillaumeeb

@wtbarnes sorry I haven't had time to look at this one indeed ...

This is the kind of PR that is quite hard to review because the diff is quite sizeable. One thing that would make me a bit a lot more relax about this PR would be if you could generate some reference submission scripts (with cluster.job_script()) from master exploring a bit different parameters and compare the submission scripts you get in this PR.

If you manage to get something going along these lines, feel free to post a snippet here!

lesteve avatar Mar 04 '20 18:03 lesteve

Also just for completeness, my current dask-jobqueue priority is to try to make some progress towards 0.7.1 in the next few weeks!

lesteve avatar Mar 04 '20 18:03 lesteve

I think we should take another look at this one. There's been some little change on the master branch code. Would you be willing to update your PR @wtbarnes?

guillaumeeb avatar May 22 '20 08:05 guillaumeeb

Sorry, I've not had the time to come back to this in a while!

I see there are quite a few conflicts now with several of the different scheduler files. I'm happy to rebase this against the master branch and try to fix those if that is the preferred workflow,

wtbarnes avatar May 22 '20 18:05 wtbarnes

@wtbarnes yes please! :)

jacobtomlinson avatar May 27 '20 12:05 jacobtomlinson

Ok I've rebased this against master and added Jinja2 to the requirements file, but now it seems all of the tests are failing so I've a bit more work to do to get this into good shape.

wtbarnes avatar May 27 '20 19:05 wtbarnes

@wtbarnes sorry I introduced a conflict by merging a PR. Here is my current thinking:

  • release a 0.7.2 with the few improvements that went in since 0.7.1. Maybe in a week or two :crossed_fingers:
  • look at this PR more seriously for 0.8

Sorry this is taking such a long time, but I am afraid this PR is not trivial to review at all and I don't have that much time for the Dask-Jobqueue maintenance these days ...

Something that would make this PR easier to review as I mentioned a few months ago (already? :sweat:)

One thing that would make me a bit a lot more relax about this PR would be if you could generate some reference submission scripts (with cluster.job_script()) from master exploring a bit different parameters and compare the submission scripts you get in this PR.

lesteve avatar Jun 01 '20 06:06 lesteve

@wtbarnes if you're stiull around, this would be the good time to finish this one!

guillaumeeb avatar Jan 23 '21 22:01 guillaumeeb

@wtbarnes if you're stiull around, this would be the good time to finish this one!

@guillaumeeb I am still around! I'm currently involved in a project that heavily uses jobqueue so I can probably justify spending some time finally getting this fixed up.

wtbarnes avatar Jan 27 '21 17:01 wtbarnes

Hi all, I'd like to see this implemented. To help out I made some tests to compare job script output with the current main.

The script is here: https://github.com/sdtaylor/dask-jobqueue/blob/jinja-templates2/dask_jobqueue/tests/test_script_generation.py

Note I had to pull a recent change to fix a get_ip_interface error, but it looks like a rebase will deal with that.

Results from the current jinja-templates commit are:

######################
Script mismatch for PBS1:
######################
--- 

+++ 

@@ -1,13 +1,14 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #PBS -N dask-worker
 #PBS -q myqueue
 #PBS -A myproject
 #PBS -l select=1:ncpus=1:mem=1908MB
 #PBS -l walltime=00:02
+
 #PBS -x extra-option1
 #PBS -z extraoption2
+JOB_ID=${PBS_JOBID%%.*}
 
-/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --interface lo --protocol tcp://
-
+/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name name --nanny --death-timeout 60 --local-directory /foo --interface lo --interface lo
######################
Script mismatch for MOAB1:
######################
--- 

+++ 

@@ -1,13 +1,14 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #PBS -N dask-worker
 #PBS -q myqueue
 #PBS -A myproject
 #PBS -l select=1:ncpus=1:mem=1908MB
 #PBS -l walltime=00:02
+
 #PBS -x extra-option1
 #PBS -z extraoption2
+JOB_ID=${PBS_JOBID%%.*}
 
-/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --interface lo --protocol tcp://
-
+/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name name --nanny --death-timeout 60 --local-directory /foo --interface lo --interface lo
######################
Script mismatch for SGE1:
######################
--- 

+++ 

@@ -1,14 +1,15 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #$ -N dask-worker
 #$ -q myqueue
 #$ -P myproject
+
 #$ -l h_rt=00:02
+
 #$ -cwd
 #$ -j y
 #$ -x extra-option1
 #$ -z extraoption2
 
-/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-
+/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name name --nanny --death-timeout 60 --local-directory /foo  
######################
Script mismatch for HTCondor:
######################
--- 

+++ 

@@ -1,18 +1,18 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 MY.DaskWorkerName = "htcondor--$F(MY.JobId)--"
 RequestCpus = MY.DaskWorkerCores
 RequestMemory = floor(MY.DaskWorkerMemory / 1048576)
 RequestDisk = floor(MY.DaskWorkerDisk / 1024)
 MY.JobId = "$(ClusterId).$(ProcId)"
 MY.DaskWorkerCores = 1
 MY.DaskWorkerMemory = 2000000000
 MY.DaskWorkerDisk = 1000000000
 
-Environment = ""
-Arguments = "-c '/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://'"
+
+Environment = " JOB_ID=$F(MY.JobId)"
+Arguments = "-c '/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name name --nanny --death-timeout 60 --local-directory /foo  '"
 Executable = /bin/sh
 
 Queue
-
######################
Script mismatch for OAR1:
######################
--- 

+++ 

@@ -1,12 +1,12 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #OAR -n dask-worker
 #OAR -q myqueue
 #OAR --project myproject
 #OAR -l /nodes=1/core=1,walltime=00:02
 #OAR -x extra-option1
 #OAR -z extraoption2
+JOB_ID=${OAR_JOB_ID}
 
-/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-
+/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name name --nanny --death-timeout 60 --local-directory /foo  
######################
Script mismatch for LSF1:
######################
--- 

+++ 

@@ -1,14 +1,14 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #BSUB -J dask-worker
+
 #BSUB -q myqueue
-#BSUB -P "myproject"
-#BSUB -n 1
+#BSUB -P myproject#BSUB -n 1
 #BSUB -M 1
 #BSUB -W 00:02
 #BSUB -x extra-option1
 #BSUB -z extraoption2
+JOB_ID=${LSB_JOBID%.*}
 
-/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-
+/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name name --nanny --death-timeout 60 --local-directory /foo  
######################
Script mismatch for SLURM1:
######################
--- 

+++ 

@@ -1,15 +1,17 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #SBATCH -J dask-worker
+
 #SBATCH -p myqueue
 #SBATCH -A myproject
 #SBATCH -n 1
 #SBATCH --cpus-per-task=1
 #SBATCH --mem=2G
 #SBATCH -t 00:02
 #SBATCH -x extra-option1
 #SBATCH -z extraoption2
 
-/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
+JOB_ID=${SLURM_JOB_ID%;*}
 
+/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name name --nanny --death-timeout 60 --local-directory /foo  

sdtaylor avatar Jul 03 '21 18:07 sdtaylor

@sdtaylor Thanks for taking a look into this! I don't know that I'll have the time to wrap this up anytime soon, so feel free to push to this PR and resolve any conflicts as needed. If it's easier, I can also try to rebase on top of main again.

wtbarnes avatar Jul 03 '21 18:07 wtbarnes

Can you do a rebase? I think that'll solve some issues. After that I have a few changes to the templates which will clean things up.

Also I'm not sure what JOB_ID is doing in the new script output, as far as I can tell it was never there before.

sdtaylor avatar Jul 03 '21 20:07 sdtaylor

@sdtaylor thanks a lot for the compare scripts, this looks quite useful to reassure me (with my maintainer hat on) that these changes do not introduce regressions.

I think the next tricky bit is the rebase now, from what I remember there are plenty of conflicts to fix ...

lesteve avatar Jul 05 '21 15:07 lesteve

I've just done a rebase against main. The conflicts were not as bad as I suspected and I've tried to make sure all of my changes are still consistent with any changes that have happened on the particular job schedulers since I opened this PR.

I'm not sure what the merge policy is on this repo, but this should probably be squash-merged

EDIT: It looks like there are quite a few test failures here, likely because of things that have not been included in the new jinja templates. @sdtaylor I wonder if your proposed changes will take care of those?

Re: the seemingly extraneous JOB_ID, I'm not sure where that came from or why exactly it seems to show up in so many of the templates.

wtbarnes avatar Jul 05 '21 21:07 wtbarnes

for the JOB_ID it looks like those were used when the jinja templates were first made, but have since been removed. So those lines in the template can be taken out. See https://github.com/dask/dask-jobqueue/pull/396

sdtaylor avatar Jul 06 '21 14:07 sdtaylor

I made some small adjustments here to drop JOB_ID and fix the interface double entry https://github.com/sdtaylor/dask-jobqueue/tree/jinja-templates. I don't have write permissions to this repo so if someone can merge them.

With those changes the current comparison looks pretty good. The only minor thing is various newlines where entries in the jinja template are skipped:

######################
Script mismatch for PBS1:
######################
--- 

+++ 

@@ -1,13 +1,13 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #PBS -N dask-worker
 #PBS -q myqueue
 #PBS -A myproject
 #PBS -l select=1:ncpus=1:mem=1908MB
 #PBS -l walltime=00:02
+
 #PBS -x extra-option1
 #PBS -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --interface lo --protocol tcp://
-
######################
Script mismatch for MOAB1:
######################
--- 

+++ 

@@ -1,13 +1,13 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #PBS -N dask-worker
 #PBS -q myqueue
 #PBS -A myproject
 #PBS -l select=1:ncpus=1:mem=1908MB
 #PBS -l walltime=00:02
+
 #PBS -x extra-option1
 #PBS -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --interface lo --protocol tcp://
-
######################
Script mismatch for SGE1:
######################
--- 

+++ 

@@ -1,14 +1,15 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #$ -N dask-worker
 #$ -q myqueue
 #$ -P myproject
+
 #$ -l h_rt=00:02
+
 #$ -cwd
 #$ -j y
 #$ -x extra-option1
 #$ -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-
######################
Script mismatch for HTCondor:
######################
--- 

+++ 

@@ -1,18 +1,18 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 MY.DaskWorkerName = "htcondor--$F(MY.JobId)--"
 RequestCpus = MY.DaskWorkerCores
 RequestMemory = floor(MY.DaskWorkerMemory / 1048576)
 RequestDisk = floor(MY.DaskWorkerDisk / 1024)
 MY.JobId = "$(ClusterId).$(ProcId)"
 MY.DaskWorkerCores = 1
 MY.DaskWorkerMemory = 2000000000
 MY.DaskWorkerDisk = 1000000000
 
+
 Environment = ""
 Arguments = "-c '/usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://'"
 Executable = /bin/sh
 
 Queue
-
######################
Script mismatch for OAR1:
######################
--- 

+++ 

@@ -1,12 +1,11 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #OAR -n dask-worker
 #OAR -q myqueue
 #OAR --project myproject
 #OAR -l /nodes=1/core=1,walltime=00:02
 #OAR -x extra-option1
 #OAR -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-
######################
Script mismatch for LSF1:
######################
--- 

+++ 

@@ -1,14 +1,15 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #BSUB -J dask-worker
+
 #BSUB -q myqueue
 #BSUB -P "myproject"
 #BSUB -n 1
+
 #BSUB -M 1
 #BSUB -W 00:02
 #BSUB -x extra-option1
 #BSUB -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-
######################
Script mismatch for SLURM1:
######################
--- 

+++ 

@@ -1,15 +1,15 @@

-***EXPECTED***
+***CURRENT***
 #!/usr/bin/env bash
 
 #SBATCH -J dask-worker
+
 #SBATCH -p myqueue
 #SBATCH -A myproject
 #SBATCH -n 1
 #SBATCH --cpus-per-task=1
 #SBATCH --mem=2G
 #SBATCH -t 00:02
 #SBATCH -x extra-option1
 #SBATCH -z extraoption2
 
 /usr/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:54321 --nthreads 1 --memory-limit 1.86GiB --name dummy-name --nanny --death-timeout 60 --local-directory /foo --protocol tcp://
-

sdtaylor avatar Jul 06 '21 14:07 sdtaylor

I've removed the JOB_ID from all of the templates. Are the newline differences important in any of these submission scripts? My experience is limited to SLURM and PBS and as far as I know neither of them care too much about extraneous newlines.

wtbarnes avatar Jul 08 '21 15:07 wtbarnes

Thanks everyone for reviving this one!!

I've got two questions:

  • Is it in a state to review?
  • One of the goal of this is to make it easy to customize job_script to adapt to specific job scheduler setup. How can one change a jinja template for a given scheduler?

guillaumeeb avatar Jul 12 '21 19:07 guillaumeeb

Is it in a state to review?

I believe so though I would defer to @sdtaylor or @lesteve as well

One of the goal of this is to make it easy to customize job_script to adapt to specific job scheduler setup. How can one change a jinja template for a given scheduler?

I think this old comment is still relevant: https://github.com/dask/dask-jobqueue/pull/370#issuecomment-589720558. It is very easy to modify the Jinja Environment to accept custom templates (that can then override the default templates). However, I'm still not sure what the ideal entrypoint would be for these custom templates. My original thinking was that this PR would simply reproduce the current functionality and then a separate PR would add the ability to supply custom templates.

wtbarnes avatar Jul 13 '21 01:07 wtbarnes