sga icon indicating copy to clipboard operation
sga copied to clipboard

Error: Duplicate read ID - sga-bam2de.pl

Open a-lud opened this issue 7 years ago • 2 comments

Hi,

I'm trying to build scaffolds from three matepair libraries with 3kb, 5kb and 8kb inserts (currently in BAM format). The person who generated these libraries has followed all steps involved from the example scripts you have provided up to the scaffolding stage.

When running the sga-bam2de.pl function on the libraries, using the same settings as the "Scaffolding multiple libraries" wiki page, an error message similar to the one below is generated for each of the three files, with only the duplicate read ID being different.

abyss-fixmate -h KLS0691b.matepair.3kb.sorted.tmp.hist /localscratch/path/to/data/pe/KLS0691b.matepair.5kb.sorted.bam | samtools view -Sb - > KLS0691b.matepair.3kb.sorted.diffcontigs.bam
error: duplicate read ID `HWI-ST1408:124:CA3J7ANXX:4:1309:4959:8862/1'
[samopen] SAM header is present: 2455895 sequences.
[sam_read1] reference 'ID:bwa   PN:bwa  VN:0.7.13-r1126 CL:bwa mem -t 8 bwa_contigs1_index/index ../1_trimmed_AdapterRemoval/KLS0691b_5KB_GCCAAT_R1_t.fastq.gz ../1_trimmed_AdapterRemoval/KLS0691b_5KB_GCCAAT_R2_t.fastq.gz
contig-1172471  LN:289
@SQ     SN:contig-1223316       LN:242
@SQ     SN:contig-9458!' is recognized as '*'.
[main_samview] truncated file.
awk '$2 >= 3' KLS0691b.matepair.3kb.sorted.tmp.hist > KLS0691b.matepair.3kb.sorted.hist
awk: cmd. line:1: fatal: cannot open file `KLS0691b.matepair.3kb.sorted.tmp.hist' for reading (No such file or directory)
samtools sort KLS0691b.matepair.3kb.sorted.diffcontigs.bam KLS0691b.matepair.3kb.sorted.diffcontigs.sorted
DistanceEst -s 200 --mind -99 -n 5 -k 99 -j 1 -o KLS0691b.matepair.3kb.sorted.de KLS0691b.matepair.3kb.sorted.hist -l 100 KLS0691b.matepair.3kb.sorted.diffcontigs.sorted.bam
error: the histogram `KLS0691b.matepair.3kb.sorted.hist' is empty

It seems the duplicate read ID is what's triggering the error, however I am unsure how to go about solving this issue. Any help or insight would be appreciated.

Cheers

a-lud avatar Apr 06 '17 00:04 a-lud

I have the same problem. I fixed it by removing all secondary or supplementary alignments from the bam files, e.g. samtools view -h -F 0x800 -o filtered.bam input.bam (or -F 0x100 if you used -S flag when aligning with BWA).

I didn't dig too deeply, but it seems to me the following is the problem. When only one of the reads has a secondary or supplementary alignment, abyss-fixmate thinks that it is a primary alignment and reports it as a duplicate.

You can try reporting it at https://github.com/bcgsc/abyss

brisk022 avatar Jun 01 '17 06:06 brisk022

I came across a similar solution in this google groups thread. It was the secondary/supplementary alignments causing the problem.

a-lud avatar Jun 01 '17 06:06 a-lud