MarkDuplicates step fails with UMIs from read names and 4 lanes
Description of the bug
I am using Sarek with a configuration that reads UMIs from the FASTQ read names.
The workflow fails at the "MarkDuplicates" step with the error message:
htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 1: RGHD789:806050
According to the documentation of this error, it can result from reads having the same read name in the BAM file.
I have four lanes for one sample and realized that UMI consensus calling is performed for each lane separately. During that process, the read names are changed to a scheme containing the sample name HD789 and a continuous number. When BAMs for all four lanes are merged, this results into different reads having the same name (4x readpairs since I have 4 lanes) that results into this error in the MarkDuplicates step.
In general. I was wondering if it is intended to call UMI consensus for each lane separately or whether it should be called on a merged file, as the same UMI could be distributed across multiple lanes.
Command used and terminal output
nextflow run nf-core-sarek-3.0.2/workflow/ --input run1_samplesheet.csv --outdir run1_full --genome GATK.GRCh38 --igenomes_base igenomes/references -profile singularity -c TSO500.config --tools freebayes,mpileup,mutect2,strelka,manta,tiddit,merge -resume --umi_read_structure NA --group_by_umi_strategy paired --snpeff_cache snpeff/ --vep_cache vep/
Error executing process > 'NFCORE_SAREK:SAREK:MARKDUPLICATES:GATK4_MARKDUPLICATES (HD789)'
Caused by:
Process `NFCORE_SAREK:SAREK:MARKDUPLICATES:GATK4_MARKDUPLICATES (HD789)` terminated with an error exit status (3)
Command executed:
gatk --java-options "-Xmx30g" MarkDuplicates \
--INPUT HD789-L002.0006.bam --INPUT HD789-L002.0002.bam --INPUT HD789-L002.0003.bam --INPUT HD789-L002.0001.bam --INPUT HD789-L002.0005.bam --INPUT HD789-L002.0004.bam --INPUT HD789-L004.0001.bam --INPUT HD789-L004.0003.bam --INPUT HD789-L004.0006.bam --INPUT HD789-L004.0004.bam --INPUT HD789-L004.0002.bam --INPUT HD789-L004.0005.bam --INPUT HD789-L001.0006.bam --INPUT HD789-L001.0003.bam --INPUT HD789-L001.0002.bam --INPUT HD789-L001.0005.bam --INPUT HD789-L001.0004.bam --INPUT HD789-L001.0001.bam --INPUT HD789-L003.0001.bam --INPUT HD789-L003.0002.bam --INPUT HD789-L003.0004.bam --INPUT HD789-L003.0003.bam --INPUT HD789-L003.0006.bam --INPUT HD789-L003.0005.bam \
--OUTPUT HD789.md.bam \
--METRICS_FILE HD789.md.metrics \
--TMP_DIR . \
-REMOVE_DUPLICATES false -VALIDATION_STRINGENCY LENIENT --CREATE_INDEX true
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:MARKDUPLICATES:GATK4_MARKDUPLICATES":
gatk4: $(echo $(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*$//')
END_VERSIONS
Command exit status:
3
Command output:
(empty)
Command error:
INFO 2022-10-18 10:53:55 MarkDuplicates Tracking 9051 as yet unmatched pairs. 546 records in RAM.
INFO 2022-10-18 10:53:59 MarkDuplicates Read 3,000,000 records. Elapsed time: 00:00:12s. Time for last 1,000,000: 3s. Last read position: chr1:113,104,534
INFO 2022-10-18 10:53:59 MarkDuplicates Tracking 16635 as yet unmatched pairs. 708 records in RAM.
INFO 2022-10-18 10:54:02 MarkDuplicates Read 4,000,000 records. Elapsed time: 00:00:16s. Time for last 1,000,000: 3s. Last read position: chr1:156,880,088
INFO 2022-10-18 10:54:02 MarkDuplicates Tracking 20851 as yet unmatched pairs. 895 records in RAM.
INFO 2022-10-18 10:54:07 MarkDuplicates Read 5,000,000 records. Elapsed time: 00:00:21s. Time for last 1,000,000: 4s. Last read position: chr1:193,212,181
INFO 2022-10-18 10:54:07 MarkDup Elapsed time: 00:00:16s. Time for last 1,000,000: 3s. Last read position: chr1:156,880,088
INFO 2022-10-18 10:54:02 MarkDuplicates Tracking 20851 as yet unmatched pairs. 895 records in RAM.
INFO 2022-10-18 10:54:07 MarkDuplicates Read 5,000,000 records. Elapsed time: 00:00:21s. Time for last 1,000,000: 4s. Last read position: chr1:193,212,181
INFO 2022-10-18 10:54:07 MarkDuplicates Tracking 25749 as yet unmatched pairs. 558 records in RAM.
INFO 2022-10-18 10:54:11 MarkDuplicates Read 6,000,000 records. Elapsed time: 00:00:25s. Time for last 1,000,000: 4s. Last read position: chr1:231,436,969
INFO 2022-10-18 10:54:11 MarkDuplicates Tracking 30831 as yet unmatched pairs. 193 records in RAM.
INFO 2022-10-18 10:54:17 MarkDuplicates Read 7,000,000 records. Elapsed time: 00:00:31s. Time for last 1,000,000: 5s. Last read position: chr2:15,946,332
INFO 2022-10-18 10:54:17 MarkDuplicates Tracking 39249 as yet unmatched pairs. 6622 records in RAM.
INFO 2022-10-18 10:54:23 MarkDuplicates Read 8,000,000 records. Elapsed time: 00:00:37s. Time for last 1,000,000: 5s. Last read position: chr2:29,511,003
INFO 2022-10-18 10:54:23 MarkDuplicates Tracking 43958 as yet unmatched pairs. 8035 records in RAM.
INFO 2022-10-18 10:54:28 MarkDuplicates Read 9,000,000 records. Elapsed time: 00:00:42s. Time for last 1,000,000: 5s. Last read position: chr2:74,090,879
INFO 2022-10-18 10:54:28 MarkDuplicates Tracking 49514 as yet unmatched pairs. 2513 records in RAM.
INFO 2022-10-18 10:54:32 MarkDuplicates Read 10,000,000 records. Elapsed time: 00:00:46s. Time for last 1,000,000: 3s. Last read position: chr2:113,223,748
INFO 2022-10-18 10:54:32 MarkDuplicates Tracking 54431 as yet unmatched pairs. 3001 records in RAM.
INFO 2022-10-18 10:54:37 MarkDuplicates Read 11,000,000 records. Elapsed time: 00:00:51s. Time for last 1,000,000: 4s. Last read position: chr2:141,169,200
INFO 2022-10-18 10:54:37 MarkDuplicates Tracking 57016 as yet unmatched pairs. 1670 records in RAM.
INFO 2022-10-18 10:54:40 MarkDuplicates Read 12,000,000 records. Elapsed time: 00:00:54s. Time for last 1,000,000: 3s. Last read position: chr2:211,630,484
INFO 2022-10-18 10:54:40 MarkDuplicates Tracking 63059 as yet unmatched pairs. 1454 records in RAM.
INFO 2022-10-18 10:54:45 MarkDuplicates Read 13,000,000 records. Elapsed time: 00:00:58s. Time for last 1,000,000: 4s. Last read position: chr3:10,039,325
INFO 2022-10-18 10:54:45 MarkDuplicates Tracking 66844 as yet unmatched pairs. 5403 records in RAM.
INFO 2022-10-18 10:54:49 MarkDuplicates Read 14,000,000 records. Elapsed time: 00:01:03s. Time for last 1,000,000: 4s. Last read position: chr3:24,097,025
INFO 2022-10-18 10:54:49 MarkDuplicates Tracking 69255 as yet unmatched pairs. 4605 records in RAM.
INFO 2022-10-18 10:54:54 MarkDuplicates Read 15,000,000 records. Elapsed time: 00:01:07s. Time for last 1,000,000: 4s. Last read position: chr3:52,561,677
INFO 2022-10-18 10:54:54 MarkDuplicates Tracking 72825 as yet unmatched pairs. 4569 records in RAM.
INFO 2022-10-18 10:54:58 MarkDuplicates Read 16,000,000 records. Elapsed time: 00:01:12s. Time for last 1,000,000: 4s. Last read position: chr3:122,076,905
INFO 2022-10-18 10:54:58 MarkDuplicates Tracking 76914 as yet unmatched pairs. 2482 records in RAM.
INFO 2022-10-18 10:55:03 MarkDuplicates Read 17,000,000 records. Elapsed time: 00:01:16s. Time for last 1,000,000: 4s. Last read position: chr3:169,764,978
INFO 2022-10-18 10:55:03 MarkDuplicates Tracking 80808 as yet unmatched pairs. 1357 records in RAM.
INFO 2022-10-18 10:55:09 MarkDuplicates Read 18,000,000 records. Elapsed time: 00:01:23s. Time for last 1,000,000: 6s. Last read position: chr3:195,674,026
INFO 2022-10-18 10:55:09 MarkDuplicates Tracking 83573 as yet unmatched pairs. 336 records in RAM.
[Tue Oct 18 10:55:11 GMT 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 1.44 minutes.
Runtime.totalMemory()=5460983808
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 3: RGHD789:1546605/A
at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:558)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:258)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Work dir:
/home/kai/ukd/projects/TSO500/project-dev-TSO500-extwf-test/nxf_sarek/run1_full/work/0a/63585815a3434e8db1c0ed8675d2d8
Relevant files
My config file looks like this:
params {
max_memory = 44.GB
max_cpus = 6
max_time = 72.h
// usage of targeted panel
wes = true
intervals = "TSO500_manifest_GRCh38.bed"
//keep some intermediate files for testing
save_bam_mapped = true
save_trimmed = true
save_split = true
save_reference = true
}
singularity {
enabled = true
cacheDir = "/home/kai/software/nextflow_singularity_cache"
}
// extract UMIS from read names, also see https://github.com/nf-core/sarek/issues/746
process {
withName: 'FASTQTOBAM' {
ext.args = '--extract-umis-from-read-names'
}
}
And my samplesheet for this one sample X like this:
patient,sample,lane,status,fastq_1,fastq_2
HD789,HD789,L001,1, HD789-DNA_S2_L001_R1_001.fastq.gz,HD789-DNA_S2_L001_R2_001.fastq.gz
HD789,HD789,L002,1, HD789-DNA_S2_L002_R1_001.fastq.gz,HD789-DNA_S2_L002_R2_001.fastq.gz
HD789,HD789,L003,1,HD789-DNA_S2_L003_R1_001.fastq.gz,HD789-DNA_S2_L003_R2_001.fastq.gz
HD789,HD789,L004,1,HD789-DNA_S2_L004_R1_001.fastq.gz,HD789-DNA_S2_L004_R2_001.fastq.gz
System information
Nextflow: 22.04.5 Hardware: Desktop Executor: local Container engine: Singularity OS: Ubuntu 22.04 Sarek: 3.0.2
Hi @sci-kai ! It might not be necessary to mark duplicates depending on your setup https://nf-co.re/sarek/3.0.2/usage#how-to-handle-umis . As to if the fastq files should be merged beforehand maybe @lescai can comment, please also note that the whole umi subworkflow is currently being overhauled to match the latest recommendations by fgbio
Thanks Friederike for the hint! I tried running it with "skip_tools = 'baserecalibrator,markduplicates'" but it reports an error at the samblaster step within the CREATE_UMI_CONSENSUS module: Can you help me with that error?
Error executing process > 'NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)'
Caused by:
Process `NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)` terminated with an error exit status (1)
Command executed:
samtools view -h HD789-L004.bam | \
samblaster -M --addMateTags | \
samtools view -Sb - >HD789-L004_unsorted_tagged.bam
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER":
samblaster: $( samblaster -h 2>&1 | head -n 1 | sed 's/^samblaster: Version //' )
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
samblaster: Version 0.1.26
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Loaded 3366 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 2 for id: NB552271:97:HWFCVBGXL:4:11401:6367:1054:NCTAGAT+TCGGAGA
samblaster: At location: *:0
samblaster: Are you sure the input is sorted by read ids?samblaster: Exiting early, the following stats are for processing preceeding the error
samblaster: Marked 0 of 51 (0.000%) total read ids as duplicates using 3288k memory in 0.000S CPU seconds and 0S wall time.
samblaster: Premature exit (return code 1).
apologies I'll try to have a look as soon as possible
Hi @sci-kai ! Apologies for the delayed response, I was on vacation. This issue is a bit older but looks like your read files might be unmated: https://github.com/GregoryFaust/samblaster/issues/37 . Could you try name sorting the read files as suggested in the linked issue?
Hi @FriederikeHanssen.
I tried to sort the input FASTQ files by read name using
zcat reads.fq.gz \ | paste - - - - \ | sort -k1,1 -S 3G \ | tr '\t' '\n' \ | gzip > sorted/reads.fq.gz
But still gives the same error message.
In the local work directory the BAM file is also not sorted:
samtools stats HD200-L002.bam | grep "is sorted:": SN is sorted: 0
However, when I run the command within the local work directory with samblaster v0.1.24 and samtools v1.6 (locally installed on our HPC) it successfully finishes:
> samtools view -h HD200-L002.bam | \
> samblaster -M --addMateTags | \
> samtools view -Sb - >HD200-L002_unsorted_tagged.bam ```
samblaster: Version 0.1.24
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Loaded 3366 header sequence entries.
samblaster: Marked 5957935 of 10699537 (55.68%) read ids as duplicates using 85164k memory in 19.209S CPU seconds and 4M40S(280S) wall time.
Reading over the issue again I am confused now. So in the original run (where MD failed), samblaster ran fine? But now it fails?
Hi @FriederikeHanssen,
sorry for the confusion and delay. Between the two runs I switched parameters from the command line into a separate configuration file. When using the configuration as described in the first post with "--skip_tools baserecalibrator, markduplicates" the UMI pipelines now runs successfully. However, if I move most parameters from the command line into a separate config file, I got the error in the samblaster step. I do not know which parameter causes this error, but it shouldn't make a difference when parameters are specified in config file or command line, right?
command line:
nextflow run ../nf-core-sarek-3.0.2/workflow/ \
--igenomes_base ../db/igenomes/references \
-profile singularity \
-c config/TSO500-hilbert-UMI.config \
-resume
config file:
params {
//Workflow arguments
input = 'config/samplesheet.csv'
outdir = 'results/'
tools = 'freebayes,mpileup,mutect2,strelka,manta,tiddit,merge'
skip_tools = 'baserecalibrator,markduplicates'
step = 'mapping'
//References and cache
genome = 'GATK.GRCh38'
vep_cache = '../db/vep'
snpeff_cache = '../db/snpeff'
// UMI handling
umi_read_structure = 'NA'
group_by_umi_strategy = 'adjacency'
// usage of targeted panel
wes = true
intervals = "../db/TSO500_manifest_GRCh38.bed"
// Resources
max_memory = 200.GB
max_cpus = 32
max_time = 72.h
}
process {
executor = 'pbspro'
module = 'Singularity/3.7.1'
queuesize = 100
clusterOptions = '-A "ZPM"'
// extract UMIS from read names, also see https://github.com/nf-core/sarek/issues/746
withName: 'FASTQTOBAM' {
ext.args = '--extract-umis-from-read-names'
}
}
error
Error executing process > 'NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)'
Caused by:
Process `NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)` terminated with an error exit status (1)
Command executed:
samtools view -h HD789-L004.bam | \
samblaster -M --addMateTags | \
samtools view -Sb - >HD789-L004_unsorted_tagged.bam
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER":
samblaster: $( samblaster -h 2>&1 | head -n 1 | sed 's/^samblaster: Version //' )
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
INFO: Converting SIF file to temporary sandbox...
samblaster: Version 0.1.26
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Loaded 3366 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 2 for id: NB552271:97:HWFCVBGXL:4:11401:6367:1054:NCTAGAT+TCGGAGA
samblaster: At location: *:0
samblaster: Are you sure the input is sorted by read ids?samblaster: Exiting early, the following stats are for processing preceeding the error
samblaster: Marked 0 of 51 (0.000%) total read ids as duplicates using 1932k memory in 0.002S CPU seconds and 0S wall time.
samblaster: Premature exit (return code 1).
INFO: Cleaning up image...
Hi! Just looking at this issue again. Actually might be completely unrelated to UMI steps, but params should never be provided with a config file but with a params file instead. There is some prioritization magic, where params provided in a config are not overwritten.
Hi, sorry for the late answer on this topic.
My error messages are now completely solved, as I skipped the markduplicates step and also use a proper params file for starting the workflow (thanks for the hint).
So the only open question left here is about the UMI consensus calling across the four files for @lescai :
I have four lanes for one sample and realized that UMI consensus calling is performed for each lane separately. [...] In general. I was wondering if it is intended to call UMI consensus for each lane separately or whether it should be called on a merged file, as the same UMI could be distributed across multiple lanes.
Hi, I am having the same issue. I have FASTQ files spread on multiple lanes, but, I believe, originating from the same library prep. I think it is ok to combine the FASTQ before processing in this instance. In general, would it not be better to add another variable to the sample sheet, like library preparation, and add a merge step for FASTQ files from the same sample, same library preparation and different lanes ? Best
@sci-kai Hi! Did you arrive at a solution? For a sample I was thinking of merging the umi-consensus bam files from e.g. 4 lanes and re-doing the workflow by using FGBIO_GROUPREADSBYUMI and FGBIO_CALLMOLECULARCONSENSUSREADS.
Ugh, this is bad, the consensus should definitely be being made in the merged lanes. I'd highly recommend using the nf-core/fastquorum pipeline for consensus read generation instead.