nextflow
nextflow copied to clipboard
New Operators based on htsjdk.
This PR is just for discussion. Please don't commit (of course :laughing: )
This PR references :
- https://twitter.com/yokofakun/status/1159072968158457856
- https://gitter.im/nextflow-io/nextflow?at=5d4bf24a90bba62a123b78c8
- https://twitter.com/yokofakun/status/1159468857934929922
I've implemented a draft of operator extracting the samples and dictionaries from a VCF/BAM/CRAM in
-
https://github.com/lindenb/nextflow/blob/pl_htsjdk_op/modules/nextflow/src/main/groovy/nextflow/splitter/SamplesSplitter.groovy
-
https://github.com/lindenb/nextflow/blob/pl_htsjdk_op/modules/nextflow/src/main/groovy/nextflow/splitter/DictionarySplitter.groovy
the package was '/groovy/nextflow/splitter' because as far as i understand, the splitterFactory requires it.
i'm not a groovy guy, I kept a basic java syntax.
The classes extend AbstractSplitter which I think is a bad idea because the instances of this classes treat the input as a stream of lines whereas the VCF, BAM, CRAM files should be treated as a simple Path. So i'm cheating in newReader : I read the data into a ugly StringBuilder and a BufferedReader reading this string is returned.
may be it's worthless, as I said, it's always possible to extract those informations with a bash script. Anyway, it is a good way to see a few internals of NF :-)
P.
Thanks for this PR, I think it's very valuable. I saw you were proposing also
Channel.fromMap("my.bam").flatMap(Htsjdk::extractSamples)
That's neat, but there isn't support for modern java methods reference in the current groovy runtime (I think it will in groovy 3). but it any case it would require some extra code to adapt to NF internals, therefore not sure there will be a concrete advance from an implementation point of view.
Regarding the PR
The classes extend AbstractSplitter which I think is a bad idea because the instances of this classes treat the input as a stream of lines whereas the VCF, BAM, CRAM files should be treated as a simple Path
Well, the API is designed to handle a file or just a stream of byte/chars. If the htsjdk cannot handle stream, I think it's fair to stick on files and throw a UnsupportedOperationException in all other cases. Maybe it could also make sense to extend directly AbstractSplitter tho I'm just guessing, not sure at this time of pros&cons about that.
Surely it would be important to a few of unit test, at least one for each supported file type.
One thing, please do not include in the commit changes to Const.groovy, conf.py and nextflow since they are automatically generated by the build script and just mess the merge process.
Thanks!
@pditommaso I'm currentlu working on a much better solution using `flatMap', give me 10'...
@pditommaso pushed... aaaand I'm back. I deleted my classes and moved everything into a new package : nextflow.htsjdk .
It contains only one class HtsjdkUtils ( https://github.com/lindenb/nextflow/blob/pl_htsjdk_op/modules/nextflow/src/main/groovy/nextflow/htsjdk/HtsjdkUtils.groovy ) . With two static methods returning a Closure one for extracting the dictionaries, the second to extract the samples.
Both closures returns a array with two items:
- attributes (sample, species...)
- the original path
example:
import nextflow.htsjdk.HtsjdkUtils
Channel.fromPath("/home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz").
flatMap(HtsjdkUtils.dictionary()).
println()
Channel.fromPath("/home/lindenb/src/jvarkit-git/src/test/resources/toy.bam").
flatMap(HtsjdkUtils.dictionary()).
println()
Channel.fromPath("/home/lindenb/src/jvarkit-git/src/test/resources/toy.fa").
flatMap(HtsjdkUtils.dictionary()).
println()
Channel.fromPath("/home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz").
flatMap(HtsjdkUtils.samples()).
println()
Channel.fromPath("/home/lindenb/src/jvarkit-git/src/test/resources/toy.bam").
flatMap(HtsjdkUtils.samples()).
println()
execute
$ ./launch.sh run test.nf
N E X T F L O W ~ version 19.08.0-SNAPSHOT
Launching `test.nf` [big_murdock] - revision: eccf5bb5b0
[[SM:S1], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[SM:S2], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[SM:S3], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[SM:S4], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[SM:S5], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:0, LN:45, SN:ref], /home/lindenb/src/jvarkit-git/src/test/resources/toy.bam]
[[index:1, LN:40, SN:ref2], /home/lindenb/src/jvarkit-git/src/test/resources/toy.bam]
[[LB=S1, PL=illumina, SM=S1, PU=run1, CN=Nantes, DS=S1], /home/lindenb/src/jvarkit-git/src/test/resources/toy.bam]
[[index:0, LN:3302, SN:RF01], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:1, LN:2687, SN:RF02], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:2, LN:2592, SN:RF03], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:3, LN:2362, SN:RF04], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:4, LN:1579, SN:RF05], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:5, LN:1356, SN:RF06], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:6, LN:1074, SN:RF07], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:7, LN:1059, SN:RF08], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:8, LN:1062, SN:RF09], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:9, LN:751, SN:RF10], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:10, LN:666, SN:RF11], /home/lindenb/src/jvarkit-git/src/test/resources/rotavirus_rf.vcf.gz]
[[index:0, LN:45, SN:ref], /home/lindenb/src/jvarkit-git/src/test/resources/toy.fa]
[[index:1, LN:40, SN:ref2], /home/lindenb/src/jvarkit-git/src/test/resources/toy.fa]
I have to say, neat! at this point, I need to find a nice to include it as a plugin
@pditommaso I'm glad you like it. This solution is much more flexible.
plugin ? I can't find any 'plugin' in the NF doc :-)
I can't find any 'plugin' in the NF doc :-)
This is the problem :)
I'm dreaming a mechanism to import it as an external module https://www.nextflow.io/docs/latest/dsl2.html#modules
Now that we have the support for custom operators. it could be possible to integration this into separate plugin. Tagging @bentsherman in the case he wants to give a try to it
Thanks for exploring this possibility, this can be an interesting plugin if there's an interest at some point