sga
sga copied to clipboard
Error: Duplicate read ID - sga-bam2de.pl
Hi,
I'm trying to build scaffolds from three matepair libraries with 3kb, 5kb and 8kb inserts (currently in BAM format). The person who generated these libraries has followed all steps involved from the example scripts you have provided up to the scaffolding stage.
When running the sga-bam2de.pl
function on the libraries, using the same settings as the "Scaffolding multiple libraries" wiki page, an error message similar to the one below is generated for each of the three files, with only the duplicate read ID being different.
abyss-fixmate -h KLS0691b.matepair.3kb.sorted.tmp.hist /localscratch/path/to/data/pe/KLS0691b.matepair.5kb.sorted.bam | samtools view -Sb - > KLS0691b.matepair.3kb.sorted.diffcontigs.bam
error: duplicate read ID `HWI-ST1408:124:CA3J7ANXX:4:1309:4959:8862/1'
[samopen] SAM header is present: 2455895 sequences.
[sam_read1] reference 'ID:bwa PN:bwa VN:0.7.13-r1126 CL:bwa mem -t 8 bwa_contigs1_index/index ../1_trimmed_AdapterRemoval/KLS0691b_5KB_GCCAAT_R1_t.fastq.gz ../1_trimmed_AdapterRemoval/KLS0691b_5KB_GCCAAT_R2_t.fastq.gz
contig-1172471 LN:289
@SQ SN:contig-1223316 LN:242
@SQ SN:contig-9458!' is recognized as '*'.
[main_samview] truncated file.
awk '$2 >= 3' KLS0691b.matepair.3kb.sorted.tmp.hist > KLS0691b.matepair.3kb.sorted.hist
awk: cmd. line:1: fatal: cannot open file `KLS0691b.matepair.3kb.sorted.tmp.hist' for reading (No such file or directory)
samtools sort KLS0691b.matepair.3kb.sorted.diffcontigs.bam KLS0691b.matepair.3kb.sorted.diffcontigs.sorted
DistanceEst -s 200 --mind -99 -n 5 -k 99 -j 1 -o KLS0691b.matepair.3kb.sorted.de KLS0691b.matepair.3kb.sorted.hist -l 100 KLS0691b.matepair.3kb.sorted.diffcontigs.sorted.bam
error: the histogram `KLS0691b.matepair.3kb.sorted.hist' is empty
It seems the duplicate read ID is what's triggering the error, however I am unsure how to go about solving this issue. Any help or insight would be appreciated.
Cheers
I have the same problem. I fixed it by removing all secondary or supplementary alignments from the bam files, e.g. samtools view -h -F 0x800 -o filtered.bam input.bam
(or -F 0x100
if you used -S
flag when aligning with BWA).
I didn't dig too deeply, but it seems to me the following is the problem. When only one of the reads has a secondary or supplementary alignment, abyss-fixmate thinks that it is a primary alignment and reports it as a duplicate.
You can try reporting it at https://github.com/bcgsc/abyss
I came across a similar solution in this google groups thread. It was the secondary/supplementary alignments causing the problem.