looper
looper copied to clipboard
Running a looper pipeline ad hoc
I wonder if there are either 1) solutions for this or 2) easy ways to add the ability to run a looper pipeline in an ad hoc manner. What I mean by that is this: occasionally, the overhead of a traditional workflow can be a bit daunting, but I really enjoy the ease of dispatching off jobs through slurm+looper.
I would love to replace traditional bash for loops with looper calls.
An example
I have a folder with hundreds of mixed-type files. Some of these might be bedGraph files. I want to convert these to .bw format. I can use bigtools bedGraphToBigWig. Traditionally, I might just use a for loop:
for file in *.bdg; do
bigtools bedGraphToBigWig $file $file.bw
done;
But this takes awhile since it goes one-by-one, and there are hundreds. I'd love to fire them all off at once using looper and slurm:
ls *.bdg | looper run "bigtools bedGraphToBigWig {$1} {$1}.bw"
I suppose I am trying to identify or nail-down a potential gap between traditional workflows and the flexibility researchers often need for quick, ad hoc job submission.
I guess the conditions for this to be useful would be:
- Extremely small PEP (one sample attribute)
- Extremely simple pipeline (bash or python one liner)
- Benefits from parallelization
@nleroy917 this is a good idea. IIRC, way back in time, @nsheff had an example or two like this which sort of "pushed the limits" "/ thought outside the box" (if I'm permitted some clichés) of looper in this way, maybe he has already a working example or something closest to this which would represent a good starting point?
From infrastructure on December 3rd, 2024:
Theres two things to solve:
- What to do with the command template?
Maybe using
-yto give it a command template (command-extra-override) is a way to provide a command template when there was none to begin with? - Can we make a PEP on the fly given some way of info?
Sure... we can make it accept
stdinand then what I wrote would work...?
Just putting here for reference, I went down the rabbit hole slightly more and it is possible to parallelize natively using bash; just use xargs:
ls *.bdg | xargs -n 1 -P $(nproc) -I {} bash -c 'bigtools bedGraphToBigWig "{}" "{}.bw"'
Only works when $(nproc) returns a value greater than one of course... so you still would need to allocate some cores for yourself. Its an interesting stop-gap, but I still think the looper version proposed above would be way better.