modules icon indicating copy to clipboard operation
modules copied to clipboard

RFC: New subworkflow bam_postprocessing

Open matthdsm opened this issue 2 years ago • 7 comments

Hi, I'd like to suggest a new submodule or suite of submodules for bam postprocessing. This would consist of the common steps taken after alignment.

  • sort
  • markdup
  • optional BQSR

Since there are multiple toolchains that offer this, I would suggest adding a "suite" of subworkflow, so each can be added on demand.

Examples:

  • Biobambam suite

flowchart LR
        BIOBAMBAM_BAM([BAM])                --> SPLIT[Split tool TBD]
        SPLIT                               --split by chromosome--> BAMSORMADUP[BamSorMaDup]
        BAMSORMADUP                         --sort/mark duplicates-->  BIOBAMBAM_SORTBAM([Postprocessed BAM])
  • Elprep suite
flowchart LR
        ELPREP_BAM([BAM])                   --> ELPREP_SPLIT[Elprep split]
        ELPREP_SPLIT                        --split by chromosome--> ELPREP_FILTER[Elprep filter]
        ELPREP_FILTER                       --sort/mark duplicates--> ELPREP_MERGE[Elprep merge]
        ELPREP_FILTER                       --BQSR/variant calling--> ELPREP_MERGE[Elprep merge]
        ELPREP_MERGE                        --> ELPREP_SORTBAM([Postprocessed BAM])
        ELPREP_MERGE                        --> ELPREP_GVCF(["Optional gVCF"])
  • GATK suite

    • no example
  • [x] This module does not exist yet with the nf-core modules list command

  • [x] There is no open pull request for this module

  • [x] There is no open issue for this module

  • [x] If I'm planning to work on this module, I added myself to the Assignees to facilitate tracking who is working on the module

matthdsm avatar Apr 12 '22 06:04 matthdsm

Sarek has some bam_postprocessing subworkflows based on GATk Best Practices that @maxulysse and me have been planning to add to subworkflows at some point. I think they would fit here: Merge or MarkDup -> QC (Qualimap, EstimateLibraryComplexity,samtools) -> BQSR (split by GATk recommended intervals; nothing needs be done, the tool takes care of it) -> GatherTables -> ApplyBQSR -> QC (Qualimap, EstimateLibraryComplexity,samtools)

FriederikeHanssen avatar Apr 12 '22 08:04 FriederikeHanssen

Is there a reason you perform the QC steps twice? We generally don't bother with BQSR, but if you've got a solid reason, we might have to reconsider

matthdsm avatar Apr 12 '22 08:04 matthdsm

Mainly user requests of getting metrics after both steps. Ups estimatelibrary complexity is only done after MD, little c&p error

FriederikeHanssen avatar Apr 12 '22 08:04 FriederikeHanssen

Are those steps something that would be useful in the general bam_qc subworkflow? I've never used qualimap. What would be the added benefit vs other tools (speed,metrics)?

matthdsm avatar Apr 12 '22 08:04 matthdsm

I was not involved in the original decision to use qualimap. @maxulysse do you remember the reasoning? I looked a little into replacing it at some point (since we use cram in sarek now and qualimap is not natively supporting it at the moment) and I couldn't find a good replacement. In sarek it is mainly used to retrieve coverage plots and GC content distribution. Which QC tools do you generally use?

FriederikeHanssen avatar Apr 13 '22 07:04 FriederikeHanssen

I took a look at qualimap and found it to be quite slow. Personally, I'm more of a proponent of spreading an analysis horizontally, so I'd use many processes instead of one monolith. I found mosdepth and goleft provide excellent coverage statistics. As for the other metrics, I'm seeing a lot of duplication from picard and fastqc.

Since we have a great tool to aggregate all of those (multiqc), I don't see much use in a slow beast like qualimap.

matthdsm avatar Apr 13 '22 07:04 matthdsm

If there's still interest in this, there's a POC here. It's just missing a small script to merge the metrics before emitting them

matthdsm avatar Jun 20 '22 07:06 matthdsm

Closing this. I'll revisit when I get to it.

matthdsm avatar Oct 20 '22 08:10 matthdsm