bowtie2
bowtie2 copied to clipboard
`--sam-append-comment` adds raw header as sam comment
Hi,
I noticed a part of the fastq header is interpreted as a fastq comment on alignment.
version: 2.4.5
command: bowtie2 --local --fast-local -p 10 -x /Shared/references/Hsapiens/hg38-noalt/bowtie2/hg38-noalt.fa --sam-append-comment -1 fq_1.fastq -2 fq_2.fastq
fastq
==> fq_1.fastq <==
@A00785:324:H5GN2DMXY:1:1101:9670:1047:AAAGATATT 1:N:0:CGGTTGTT+GTGGTATG
GCCAAGCCAGGCACCTGCAGCCTCTTTTTTTTTTTCTGGTTGGGACACAT
+
:FFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFF:F::FF:FFFFFF:FF,
==> fq_2.fastq <==
@A00785:324:H5GN2DMXY:1:1101:9670:1047:AAAGATATT 2:N:0:CGGTTGTT+GTGGTATG
CCAGGCCTTTGTGGTTCAGCAGTTGTATGGTTTCCACCCTCAGAAACTTAC
+
FFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
sam output
A00785:324:H5GN2DMXY:1:1101:9670:1047:AAAGATATT 99 chr17 34991162 44 50M = 34991294 182 GCCAAGCCAGGCACCTGCAGCCTCTTTTTTTTTTTCTGGTTGGGACACAT :FFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFF:F::FF:FFFFFF:FF, AS:i:100 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YS:i:100 YT:Z:CP 1:N:0:CGGTTGTT+GTGGTATG
A00785:324:H5GN2DMXY:1:1101:9670:1047:AAAGATATT 147 chr17 34991294 44 1S50M = 34991162 -182 GTAAGTTTCTGAGGGTGGAAACCATACAACTGCTGAACCACAAAGGCCTGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFF AS:i:100 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YS:i:100 YT:Z:CP 2:N:0:CGGTTGTT+GTGGTATG
As you can see, the index part of the fastq header 1:N:0:CGGTTGTT+GTGGTATG is interpreted as a sam comment instead of being added under the BC:Z: tag
Illumina defines their fastq header as
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index>
so I suppose when parsing the header for comments, the first whitespace should be ignored.
Matthias
I must note bwa has the same issue
I pushed a change to bug_fixes that seeks to address this.
./bowtie2-align-s -x example/index/lambda_virus test.fq --sam-append-comment
Read is: 1
1 reads; of these:
1 (100.00%) were unpaired; of these:
1 (100.00%) aligned 0 times
0 (0.00%) aligned exactly 1 time
0 (0.00%) aligned >1 times
0.00% overall alignment rate
@HD VN:1.5 SO:unsorted GO:query
@SQ SN:gi|9626243|ref|NC_001416.1| LN:48502
@PG ID:bowtie2 PN:bowtie2 VN:2.4.5 CL:"/bowtie2/bowtie2-align-s -x example/index/lambda_virus test.fq --sam-append-comment"
A00785:324:H5GN2DMXY:1:1101:9670:1047:AAAGATATT 4 * 0 0 * * 0 0 GCCAAGCCAGGCACCTGCAGCCTCTTTTTTTTTTTCTGGTTGGGACACAT :FFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFF:F::FF:FFFFFF:FF, YT:Z:UU BC:Z:1:N:0:CGGTTGTT+GTGGTATG
Parsing assumes that the "index part" of the FASTQ header appears immediately after the read name and uses the following guidelines from Illumina.
<read> Numerical Read number. 1 can be single
read or Read 2 of paired-end.
<is filtered> Y or N Y if the read is filtered (did
not pass), N otherwise.
<control number> Numerical 0 when none of the control bits
are on, otherwise it is an even
number.
On HiSeq X and NextSeq systems,
control specification is not
performed and this number is
always 0.
<index> Restricted characters: Index of the read.
A/T/G/C/N
Hi,
Thanks for the quick reply and fix! Looks great 😄 Now just to nitpick a bit, but do you suppose it makes sense to keep the read/filtered/control number info in the barcode tag instead of just the sequence?
This has already been released to production.