samtools
samtools copied to clipboard
Should qname affect mpileup reults?
Are you using the latest version of samtools and HTSlib? If not, please specify.
(run samtools --version
)
root@595261ff27d6:/opt/notebooks# samtools --version
samtools 1.19.2
Using htslib 1.19.1
Copyright (C) 2024 Genome Research Ltd.
Samtools compilation details:
Features: build=configure curses=yes
CC: gcc
CPPFLAGS:
CFLAGS: -Wall -g -O2
LDFLAGS:
HTSDIR: htslib-1.19.1
LIBS:
CURSES_LIB: -lncursesw
HTSlib compilation details:
Features: build=configure libcurl=yes S3=yes GCS=yes libdeflate=no lzma=yes bzip2=yes plugins=no htscodecs=1.6.0
CC: gcc
CPPFLAGS:
CFLAGS: -Wall -g -O2 -fvisibility=hidden
LDFLAGS: -fvisibility=hidden
HTSlib URL scheme handlers present:
built-in: preload, data, file
S3 Multipart Upload: s3w, s3w+https, s3w+http
Amazon S3: s3+https, s3+http, s3
Google Cloud Storage: gs+http, gs+https, gs
libcurl: imaps, pop3, http, smb, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, rtmp, ldap, https, ldaps, smbs, tftp, pop3s, dict
crypt4gh-needed: crypt4gh
mem: mem
Please describe your environment.
- OS (run
uname -sr
on Linux/Mac OS orwmic os get Caption, Version
on Windows) - machine architecture (run
uname -m
on Linux/Mac OS orwmic os get OSArchitecture
on Windows) - compiler (run
gcc --version
orclang --version
)
root@595261ff27d6:/opt/notebooks# uname -sr
Linux 5.15.0-1060-aws
root@595261ff27d6:/opt/notebooks# uname -m
x86_64
root@595261ff27d6:/opt/notebooks# gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Please specify the steps taken to generate the issue, the command you are running and the relevant output.
I have some Illumina NovaSeq 6000 sequencing data, and I demultiplex the data two times, the first time with BCL Convert and the other time with bcl2fastq. For one of the samples, I put the FASTQs from demultiplexing through the same version of bwa and samblaster, resulting in two BAM files for analysis. I compare every single record in the BAM files using samtools view
and every record is identical except for read names, which is not surprising. I run the BAM files through both samtools stats
and samtools flagstats
, and in both cases, the outputs are identical.
However, this is not the case with samtools mpileup
. In other words, samtools mpileup
produces different results. I zoom in on a position in the genome and look at the reads from each BAM file that were used to produce the outputs with --output-QNAME
. With a mapping between the read names from BCL Convert and bcl2fastq, I notice that reads used in one case were not used in the other. It seems to me that read names play a role in what reads are used to produce mpileup results. Is this an expected behavior of samtools? Because I was not expecting the differences in read names to affect how mpileup processes reads.
The query names are used for read-pairing, but the names could be anything at all. It's just important that they match. Note SAM doesn't attach any meaning to specific suffixes used by some tooling, so foo/1 and foo/2 are two different names in SAM and won't get paired.
I'm not familier with the differences between bclconvert and bcl2fastq. Do you have examples of the types of query name differences they produce?
Here are first five lines of the mapping between query names from bcl2fastq and those from bclconvert, in the order they show up in each BAM. Note that BAMs are coordinate-sorted.
bcl2fastq,bclconvert
A01360:22:H3KJ7DRXY:2:2211:0:3933957,A01360:22:H3KJ7DRXY:2:2211:667:4807
A01360:22:H3KJ7DRXY:2:2178:0:2525347,A01360:22:H3KJ7DRXY:2:2178:430:2651
A01360:22:H3KJ7DRXY:1:2257:0:1851146,A01360:22:H3KJ7DRXY:1:2257:264:3901
A01360:22:H3KJ7DRXY:1:2257:0:1877914,A01360:22:H3KJ7DRXY:1:2257:269:1319
A01360:22:H3KJ7DRXY:2:2211:0:3933957,A01360:22:H3KJ7DRXY:2:2211:667:4807
Here are some examples of the mpileup output. Both the BAM from bclconvert and the BAM from bcl2fastq have the same number of reads covering this position. However, the read bases are not exactly the same. Neither are the base quality strings. I thought this would be due to the order differences in the query names, but that does not seem to be the case because the number of each symbol is not the same in both cases. Another thing I checked is whether the reads used for the position are the same, and the answer is yes after mapping the read names from one to the other. However, it does look like sometimes the forward strand of a read pair is used and other times the reverse strand. This is what I really meant when I said in my original post that I notice reads used in one case were not used in the other.
Is this expected?
Example 1: Position 909309 of Chromosome 1
bclconvert: counter = {'.': 60, ',': 50, '$': 2}
.$.$,,...,........,,,,.,...,.,,,...,.........,....,,,,..,,,,,,.,,,,,.,.,,,...,,.,,,.,,,.,.......,...,,...,,.,.,,
Counter = {'.': 60, ',': 50, '$': 2}
bcl2fastq: counter = {'.': 64, ',': 46, '$': 2}
.$.$..,....,.........,,,,..,..,.,.,....,........,...,,,,..,,,,,.,,,.,.,...,,,,.,,,.,,,.,,,.......,...,,...,.,.,,
Example 2: Position 949608 of Chromosome 1
bclconvert: counter = {'.': 44, ',': 37, 'A': 36, 'a': 33, '$': 2, '^': 1, ']': 1}
a$,$..a,,a.a...aAa,aAA,,aAaAAAA,aaAA,..Aaaaa.a,A,a,a.a,a....A...A,AA..AAAAAA.,,..A...a.,A,.A,,A.a...a...,.aaa.AA,,..,.,,,,,AA,,aA.AA,,.a.,Aaa.a,AA.a,,a^],
bcl2fastq: counter = {'.': 44, ',': 37, 'A': 35, 'a': 34, '$': 2, '^': 2, ']': 2}
a$,$..aA,,a.a...aAa,aA,,aAaAAAAA,aaAA,..Aaaaa.a,A,a,aA.a,a..Aa.....AAA..AaAA.,.,a..A....a.a,A,.A,A...a.,.,.aaa.AA,...,,,,.,AA,aA.AA,,,.a,,Aaa.a,A,A.,,^],^],
Another thing I checked is whether the reads used for the position are the same, and the answer is yes after mapping the read names from one to the other
Actually, I looked at another position and it turned out that the reads from bcl2fastq BAM and those from bclconvert BAM used for this position are not identical. There are 10 instances of this phenotype out of 13146 different positions.
Are you sure this isn't an issue of same input data (bar name), but different alignment output data? Some aligners are not stable and given the exact same input will produce differing output every run.
That's what I first thought too, but I compared every single record in both BAM files and everything matched line by line except for the query names. Plus, samtools stats
and samtools flagstats
produced identical results
However, it does look like sometimes the forward strand of a read pair is used and other times the reverse strand. This is what I really meant when I said in my original post that I notice reads used in one case were not used in the other.
Ahhh this is now ringing a bell. Sure enough https://github.com/samtools/htslib/pull/1273 shows why.
Htslib removes one read when pairs overlap, unless mpileup --ignore-overlaps-removal
is used. Previously it always did this based on quality, but where READ1 is consistently higher quality than READ2 this causes a heavy strand bias. This was particularly bad for amplicon sequencing strategies. Instead we now remove in a semi-random fashion so we do not introduce strand biases. To ensure consistency between repeated runs, the randomness is generated by hashing the read name (a basic naive bit-scrambling).
So I can now see how renaming reads could cause different subsets to be chosen when removing overlapping bases. I initially didn't think of this as it's very rare we hit this code anyway given overlapping read pairs aren't the norm for most sequencing protocols.