oncoanalyser Eliminate bottlenecking of markdups

Description of feature

The pipeline currently seems to have a bottleneck at the alignment -> markdups step, where all the alignment has to be completed before any markdups processes will begin. The pipeline already uses groupKey to determine how many files should be expected from the splitting process, but this happens after the bwamem2 mapping step.

Aug 02 '24 09:08 SPPearce

I haven't been able to replicate the bottleneck as I understand from your description.

For some additional context, each MarkDups task must receive all BAMs for a given sample before starting to process and merge into a single output BAM. So blocking in that sense on a per-sample basis is intended and required. However, there should not be blocking/bottlenecking where all alignments must complete before any MarkDups process begins.

I've run oncoanalyser in stub mode and added an artificial 60 second delay to one sample in the bwa-mem2 process to evaluate flow through the NF channels. As expected, all MarkDups tasks run as soon as each set of sample BAMs become available (see attached timeline and below expandable to replicate).

If you're seeing different behaviour, could you please provide some additional details of your observations and how you're running oncoanalyser?

Attachment: execution_timeline_2024-08-05_12-36-17.html.gz

oncoanalyser bwa-mem2/MarkDups data flow check (click to show)

Get and patch oncoanalyser with an artificial 60 second delay in bwa-mem2 for the 'sa.tumor' sample

git clone https://github.com/nf-core/oncoanalyser
(cd oncoanalyser/ && git checkout 41010dd)

cat <<EOF > alignment-delay.patch
--- a/oncoanalyser/modules/local/bwa-mem2/mem/main.nf
+++ b/oncoanalyser/modules/local/bwa-mem2/mem/main.nf
@@ -64,6 +64,10 @@ process BWAMEM2_ALIGN {

     """
+    if [[ \${meta.sample_id} == 'sa.tumor' ]]; then
+      sleep 60;
+    fi
+
     touch \${output_fn}
     touch \${output_fn}.bai

EOF

patch -lp1 < alignment-delay.patch

Create samplesheet

cat <<EOF > samplesheet.csv
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath
sa_debug,sa,sa.normal,normal,dna,fastq,library_id:sa.normal.lb;lane:1,$(pwd)/temp/sa.normal.R1.fastq.gz;$(pwd)/temp/sa.normal.R2.fastq.gz
sa_debug,sa,sa.tumor,tumor,dna,fastq,library_id:sa.tumor.lb;lane:1,$(pwd)/temp/sa.tumor.R1.fastq.gz;$(pwd)/temp/sa.tumor.R2.fastq.gz

sb_debug,sb,sb.normal,normal,dna,fastq,library_id:sb.normal.lb;lane:1,$(pwd)/temp/sb.normal.R1.fastq.gz;$(pwd)/temp/sb.normal.R2.fastq.gz
sb_debug,sb,sb.tumor,tumor,dna,fastq,library_id:sb.tumor.lb;lane:1,$(pwd)/temp/sb.tumor.R1.fastq.gz;$(pwd)/temp/sb.tumor.R2.fastq.gz
EOF

Create local configuration

cat <<EOF > stub.config
params {
    genomes {
        'GRCh38_hmf' {
            fasta         = "$(pwd)/temp/GRCh38.fasta"
            fai           = "$(pwd)/temp/GRCh38.fai"
            dict          = "$(pwd)/temp/GRCh38.dict"
            bwamem2_index = "$(pwd)/temp/GRCh38_bwa-mem2_index/"
            gridss_index  = "$(pwd)/temp/GRCh38_gridss_index/"
            star_index    = "$(pwd)/temp/GRCh38_star_index/"
        }
    }
  ref_data_virusbreakenddb_path = '$(pwd)/temp/virusbreakenddb_20210401/'
  ref_data_hmf_data_path = '$(pwd)/temp/hmf_bundle_38/'
  ref_data_panel_data_path = '$(pwd)/temp/panel_bundle/tso500_38/'
}
EOF

Run oncoanalyser

nextflow run -config stub.config oncoanalyser/main.nf \
  \
  -stub \
  --create_stub_placeholders \
  \
  --max_cpus 1 \
  --max_memory 1.GB \
  \
  --mode wgts \
  --genome GRCh38_hmf \
  --input samplesheet.csv \
  --outdir output_stub/

Aug 05 '24 02:08 scwatts

Closing the issue but please re-open if you'd like to discuss further!

Sep 11 '24 23:09 scwatts

Closing the issue but please re-open if you'd like to discuss further!

Ah, completely forgot about this one, been busy with other bits ATM.

Sep 12 '24 06:09 SPPearce