samtools icon indicating copy to clipboard operation
samtools copied to clipboard

Should qname affect mpileup reults?

Open smlchen opened this issue 2 months ago • 6 comments

Are you using the latest version of samtools and HTSlib? If not, please specify.

(run samtools --version)

root@595261ff27d6:/opt/notebooks# samtools --version
samtools 1.19.2
Using htslib 1.19.1
Copyright (C) 2024 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             gcc
    CPPFLAGS:       
    CFLAGS:         -Wall -g -O2
    LDFLAGS:        
    HTSDIR:         htslib-1.19.1
    LIBS:           
    CURSES_LIB:     -lncursesw

HTSlib compilation details:
    Features:       build=configure libcurl=yes S3=yes GCS=yes libdeflate=no lzma=yes bzip2=yes plugins=no htscodecs=1.6.0
    CC:             gcc
    CPPFLAGS:       
    CFLAGS:         -Wall -g -O2 -fvisibility=hidden
    LDFLAGS:        -fvisibility=hidden 

HTSlib URL scheme handlers present:
    built-in:    preload, data, file
    S3 Multipart Upload:         s3w, s3w+https, s3w+http
    Amazon S3:   s3+https, s3+http, s3
    Google Cloud Storage:        gs+http, gs+https, gs
    libcurl:     imaps, pop3, http, smb, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, rtmp, ldap, https, ldaps, smbs, tftp, pop3s, dict
    crypt4gh-needed:     crypt4gh
    mem:         mem

Please describe your environment.

  • OS (run uname -sr on Linux/Mac OS or wmic os get Caption, Version on Windows)
  • machine architecture (run uname -m on Linux/Mac OS or wmic os get OSArchitecture on Windows)
  • compiler (run gcc --version or clang --version)
root@595261ff27d6:/opt/notebooks# uname -sr
Linux 5.15.0-1060-aws
root@595261ff27d6:/opt/notebooks# uname -m
x86_64
root@595261ff27d6:/opt/notebooks# gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Please specify the steps taken to generate the issue, the command you are running and the relevant output.

I have some Illumina NovaSeq 6000 sequencing data, and I demultiplex the data two times, the first time with BCL Convert and the other time with bcl2fastq. For one of the samples, I put the FASTQs from demultiplexing through the same version of bwa and samblaster, resulting in two BAM files for analysis. I compare every single record in the BAM files using samtools view and every record is identical except for read names, which is not surprising. I run the BAM files through both samtools stats and samtools flagstats, and in both cases, the outputs are identical.

However, this is not the case with samtools mpileup. In other words, samtools mpileup produces different results. I zoom in on a position in the genome and look at the reads from each BAM file that were used to produce the outputs with --output-QNAME. With a mapping between the read names from BCL Convert and bcl2fastq, I notice that reads used in one case were not used in the other. It seems to me that read names play a role in what reads are used to produce mpileup results. Is this an expected behavior of samtools? Because I was not expecting the differences in read names to affect how mpileup processes reads.

smlchen avatar May 04 '24 07:05 smlchen

The query names are used for read-pairing, but the names could be anything at all. It's just important that they match. Note SAM doesn't attach any meaning to specific suffixes used by some tooling, so foo/1 and foo/2 are two different names in SAM and won't get paired.

I'm not familier with the differences between bclconvert and bcl2fastq. Do you have examples of the types of query name differences they produce?

jkbonfield avatar May 07 '24 08:05 jkbonfield

Here are first five lines of the mapping between query names from bcl2fastq and those from bclconvert, in the order they show up in each BAM. Note that BAMs are coordinate-sorted.

bcl2fastq,bclconvert
A01360:22:H3KJ7DRXY:2:2211:0:3933957,A01360:22:H3KJ7DRXY:2:2211:667:4807
A01360:22:H3KJ7DRXY:2:2178:0:2525347,A01360:22:H3KJ7DRXY:2:2178:430:2651
A01360:22:H3KJ7DRXY:1:2257:0:1851146,A01360:22:H3KJ7DRXY:1:2257:264:3901
A01360:22:H3KJ7DRXY:1:2257:0:1877914,A01360:22:H3KJ7DRXY:1:2257:269:1319
A01360:22:H3KJ7DRXY:2:2211:0:3933957,A01360:22:H3KJ7DRXY:2:2211:667:4807

Here are some examples of the mpileup output. Both the BAM from bclconvert and the BAM from bcl2fastq have the same number of reads covering this position. However, the read bases are not exactly the same. Neither are the base quality strings. I thought this would be due to the order differences in the query names, but that does not seem to be the case because the number of each symbol is not the same in both cases. Another thing I checked is whether the reads used for the position are the same, and the answer is yes after mapping the read names from one to the other. However, it does look like sometimes the forward strand of a read pair is used and other times the reverse strand. This is what I really meant when I said in my original post that I notice reads used in one case were not used in the other.

Is this expected?

Example 1: Position 909309 of Chromosome 1

bclconvert: counter = {'.': 60, ',': 50, '$': 2} .$.$,,...,........,,,,.,...,.,,,...,.........,....,,,,..,,,,,,.,,,,,.,.,,,...,,.,,,.,,,.,.......,...,,...,,.,.,,

Counter = {'.': 60, ',': 50, '$': 2}

bcl2fastq: counter = {'.': 64, ',': 46, '$': 2} .$.$..,....,.........,,,,..,..,.,.,....,........,...,,,,..,,,,,.,,,.,.,...,,,,.,,,.,,,.,,,.......,...,,...,.,.,,

Example 2: Position 949608 of Chromosome 1

bclconvert: counter = {'.': 44, ',': 37, 'A': 36, 'a': 33, '$': 2, '^': 1, ']': 1} a$,$..a,,a.a...aAa,aAA,,aAaAAAA,aaAA,..Aaaaa.a,A,a,a.a,a....A...A,AA..AAAAAA.,,..A...a.,A,.A,,A.a...a...,.aaa.AA,,..,.,,,,,AA,,aA.AA,,.a.,Aaa.a,AA.a,,a^],

bcl2fastq: counter = {'.': 44, ',': 37, 'A': 35, 'a': 34, '$': 2, '^': 2, ']': 2} a$,$..aA,,a.a...aAa,aA,,aAaAAAAA,aaAA,..Aaaaa.a,A,a,aA.a,a..Aa.....AAA..AaAA.,.,a..A....a.a,A,.A,A...a.,.,.aaa.AA,...,,,,.,AA,aA.AA,,,.a,,Aaa.a,A,A.,,^],^],

smlchen avatar May 09 '24 02:05 smlchen

Another thing I checked is whether the reads used for the position are the same, and the answer is yes after mapping the read names from one to the other

Actually, I looked at another position and it turned out that the reads from bcl2fastq BAM and those from bclconvert BAM used for this position are not identical. There are 10 instances of this phenotype out of 13146 different positions.

smlchen avatar May 09 '24 03:05 smlchen

Are you sure this isn't an issue of same input data (bar name), but different alignment output data? Some aligners are not stable and given the exact same input will produce differing output every run.

jkbonfield avatar May 09 '24 07:05 jkbonfield

That's what I first thought too, but I compared every single record in both BAM files and everything matched line by line except for the query names. Plus, samtools stats and samtools flagstats produced identical results

smlchen avatar May 09 '24 10:05 smlchen

However, it does look like sometimes the forward strand of a read pair is used and other times the reverse strand. This is what I really meant when I said in my original post that I notice reads used in one case were not used in the other.

Ahhh this is now ringing a bell. Sure enough https://github.com/samtools/htslib/pull/1273 shows why.

Htslib removes one read when pairs overlap, unless mpileup --ignore-overlaps-removal is used. Previously it always did this based on quality, but where READ1 is consistently higher quality than READ2 this causes a heavy strand bias. This was particularly bad for amplicon sequencing strategies. Instead we now remove in a semi-random fashion so we do not introduce strand biases. To ensure consistency between repeated runs, the randomness is generated by hashing the read name (a basic naive bit-scrambling).

So I can now see how renaming reads could cause different subsets to be chosen when removing overlapping bases. I initially didn't think of this as it's very rare we hit this code anyway given overlapping read pairs aren't the norm for most sequencing protocols.

jkbonfield avatar May 13 '24 13:05 jkbonfield