bsub
bsub copied to clipboard
template many jobs in single command.
We often have a list of files and we want to the same thing to all of them. If we know the length, we can use an array, but what if it could be like:
inputs = "/path/to/*.sam"
command = "samtools view -bS {input} > results/{basename}.bam
where we automatically define things like basename, dirname, date, wd,
input etc. based on each item in input.
then, could specify a name that is either a function is called on each input and which returns a name. [ default is os.path.basename]
def name(input):
return input.split("/")[-1].split(".")[1]
or a string that's a regular expression performed on the input that captures the name:
name="info.(sample_\d+).sam"
Where the capture from (sample_\d+) would become available as name.
Then they could use "{name}" in their template. And they could also
use named captures in the regexp which could then bed used in their format
string (http://docs.python.org/2/library/re.html#re.MatchObject.groupdict)
I think this would clean up a lot of my pipelines. so then it would b like:
from bsub import bsub
job_ids, names = bsub.template(command, inputs, name_getter=name)
If command is actually a list:
commands = ["samtools view -bS {input} > results/{basename}.bam",
"samtools index results/{basename}.bam"]
bsub.template(commands, inputs)
and for each input, the runs job(commands[0]).then(commands[1]) ... and that can be as many commands as needed.
This would be nicer if we had better support for checking successful completion of a job inside of .then() by polling and checking the log file for successful termination instead of relying on LSF's wait().
As written, template would be a classmethod. It could also be called "multisub".
A complete example (working in the template branch) would look like:
import re
from bsub import bsub
commands = 'bwameth.py --reference {ref} --read-group {sample} -p {results}/{name} {r1} {r2}'
inputs = "/proj/brentp/2014/ken-target-bis-seq/*/*_R1_*.fastq.gz"
def namer(fq):
info = re.match(r".+/(?P<name>(?P<sample>.+)_[ATCG]+_L\d+)_R1_[01]{3}", fq).groupdict()
info['r1'] = fq
info['r2'] = info['r1'].replace('_R1_', '_R2_')
return info
bsub.template(commands, inputs, verbose=666, name_getter=namer,
info_dict=dict(ref="/path/to/ref.fa", results="/results/"), n=12, R="rusage[mem=100]")
where here, I have to use name_getter as a function because I need to get R2
where the extra kwargs like n and R get sent straight to bsub
As of e438378e8b7b3ac0c78783f1ece0f73ecd the above can be shortened to:
import re
from bsub import bsub
commands = 'bwameth.py --reference {ref} --read-group {sample} -p {results}/{name} {r1} {r2}'
inputs = "/proj/brentp/2014/ken-target-bis-seq/*/*_R1_*.fastq.gz"
namer = r".+/(?P<name>(?P<sample>.+)_[ATCG]+_L\d+)_R1_[01]{3}"
bsub.template(commands, inputs, verbose=666, name_getter=namer,
info_dict=dict(ref="/path/to/ref.fa", results="/results/"), n=12, R="rusage[mem=100]")
by seeing if the first file is a fastq and if s/R1/R2/ of that exists. So r1, r2, fq1, fq2 are autofilled and available as template variables.