Implement parallelization for BQSR
Cf. http://gatkforums.broadinstitute.org/wdl/discussion/1988/a-primer-on-parallelism-with-the-gatk
and http://gatkforums.broadinstitute.org/gatk/discussion/1919/parallelizing-base-quality-score-recalibration for specifics wrt BQSR
This can be scatter-gathered according to the above links. Will be a huge win for us.
Can also be done with Spark
On Monday, March 21, 2016, Isaac Hodes [email protected] wrote:
This can be scatter-gathered according to the above links. Will be a huge win for us.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/hammerlab/biokepi/issues/184#issuecomment-199501797
@smondet Assigning you as you mentioned being mostly there on this at some pt. Unassign if you can't get to this, no worries.
The link above is implying something different than implemented in #286. While they gather the individual statistics in a parallel manner, there is a reduce step to create a single covariate table.
Second the gathering will be more complicated than just concatenating text files. To combine the GATKReports, you need to fundamentally understand the GATKReport format. Reports have be combined statistically by adding the observations of each covariate and recalculating the Estimated Q value of the combined report.
Nice catch; seems pretty important. @smondet is that something we could support in #286?
@ihodes I don't think I understand enough the data to do that "statistical combination" Then if we have the tool that does it; it's doable yes (but harder than the current version).