bowtie2 icon indicating copy to clipboard operation
bowtie2 copied to clipboard

`--sam-append-comment` adds raw header as sam comment

Open matthdsm opened this issue 3 years ago • 3 comments

Hi,

I noticed a part of the fastq header is interpreted as a fastq comment on alignment. version: 2.4.5 command: bowtie2 --local --fast-local -p 10 -x /Shared/references/Hsapiens/hg38-noalt/bowtie2/hg38-noalt.fa --sam-append-comment -1 fq_1.fastq -2 fq_2.fastq fastq

==> fq_1.fastq <==
@A00785:324:H5GN2DMXY:1:1101:9670:1047:AAAGATATT 1:N:0:CGGTTGTT+GTGGTATG
GCCAAGCCAGGCACCTGCAGCCTCTTTTTTTTTTTCTGGTTGGGACACAT
+
:FFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFF:F::FF:FFFFFF:FF,

==> fq_2.fastq <==
@A00785:324:H5GN2DMXY:1:1101:9670:1047:AAAGATATT 2:N:0:CGGTTGTT+GTGGTATG
CCAGGCCTTTGTGGTTCAGCAGTTGTATGGTTTCCACCCTCAGAAACTTAC
+
FFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

sam output

A00785:324:H5GN2DMXY:1:1101:9670:1047:AAAGATATT 99      chr17   34991162        44      50M     =       34991294        182     GCCAAGCCAGGCACCTGCAGCCTCTTTTTTTTTTTCTGGTTGGGACACAT      :FFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFF:F::FF:FFFFFF:FF,      AS:i:100        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YS:i:100        YT:Z:CP 1:N:0:CGGTTGTT+GTGGTATG
A00785:324:H5GN2DMXY:1:1101:9670:1047:AAAGATATT 147     chr17   34991294        44      1S50M   =       34991162        -182    GTAAGTTTCTGAGGGTGGAAACCATACAACTGCTGAACCACAAAGGCCTGG     FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFF     AS:i:100        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YS:i:100        YT:Z:CP 2:N:0:CGGTTGTT+GTGGTATG

As you can see, the index part of the fastq header 1:N:0:CGGTTGTT+GTGGTATG is interpreted as a sam comment instead of being added under the BC:Z: tag

Illumina defines their fastq header as

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index>

so I suppose when parsing the header for comments, the first whitespace should be ignored.

Matthias

matthdsm avatar Feb 11 '22 14:02 matthdsm

I must note bwa has the same issue

matthdsm avatar Feb 11 '22 14:02 matthdsm

I pushed a change to bug_fixes that seeks to address this.

 ./bowtie2-align-s -x example/index/lambda_virus test.fq --sam-append-comment
Read is: 1
1 reads; of these:
  1 (100.00%) were unpaired; of these:
    1 (100.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
0.00% overall alignment rate
@HD	VN:1.5	SO:unsorted	GO:query
@SQ	SN:gi|9626243|ref|NC_001416.1|	LN:48502
@PG	ID:bowtie2	PN:bowtie2	VN:2.4.5	CL:"/bowtie2/bowtie2-align-s -x example/index/lambda_virus test.fq --sam-append-comment"
A00785:324:H5GN2DMXY:1:1101:9670:1047:AAAGATATT	4	*	0	0	*	*	0	0	GCCAAGCCAGGCACCTGCAGCCTCTTTTTTTTTTTCTGGTTGGGACACAT	:FFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFF:F::FF:FFFFFF:FF,	YT:Z:UU	BC:Z:1:N:0:CGGTTGTT+GTGGTATG

Parsing assumes that the "index part" of the FASTQ header appears immediately after the read name and uses the following guidelines from Illumina.

  <read>   Numerical  Read number. 1 can be single
      read or Read 2 of paired-end.
  <is filtered>   Y or N  Y if the read is filtered (did
      not pass), N otherwise.
  <control number>   Numerical  0 when none of the control bits
      are on, otherwise it is an even
      number.

      On HiSeq X and NextSeq systems,
      control specification is not
      performed and this number is
      always 0.
  <index>   Restricted characters:  Index of the read.
    A/T/G/C/N

ch4rr0 avatar Feb 11 '22 22:02 ch4rr0

Hi,

Thanks for the quick reply and fix! Looks great 😄 Now just to nitpick a bit, but do you suppose it makes sense to keep the read/filtered/control number info in the barcode tag instead of just the sequence?

matthdsm avatar Feb 14 '22 12:02 matthdsm

This has already been released to production.

ch4rr0 avatar Nov 09 '23 18:11 ch4rr0