PureCN icon indicating copy to clipboard operation
PureCN copied to clipboard

PureCN unintentionally filters out reads with supplementary alignments

Open tinyheero opened this issue 1 year ago • 3 comments

In the .calculateBamCoverageByInterval() function:

.calculateBamCoverageByInterval <- function(intervalGr, ...) {
  param <- ScanBamParam(what = c("pos", "qwidth", "flag"),
              which = intervalGr,
              flag = scanBamFlag(isUnmappedQuery = FALSE,
                           isNotPassingQualityControls = FALSE,
                           isSecondaryAlignment = FALSE,
                           isDuplicate = NA
                   ),
              ...
           )

  xAll <- scanBam(bam.file, index = index.file, param = param)
  xDupFiltered <- .filterDuplicates(xAll)

There is no explicit use of the isSupplementaryAlignment argument in the scanBamFlag() call leading one to believe that reads with supplementary alignments are not filtered out. However the call to the .filterDuplicates() function:

.filterDuplicates <- function(x) {
    lapply(x, function(y) {
        idx <- y$flag < 1024
        lapply(y, function(z) z[idx])
    })
}

Will retain reads that have a SAM flag less than 1024.

However, reads with supplementary alignments have a SAM flag of 2048.

$> samtools flag
About: Convert between textual and numeric flag representation
Usage: samtools flags FLAGS...

Each FLAGS argument is either an INT (in decimal/hexadecimal/octal) representing
a combination of the following numeric flag values, or a comma-separated string
NAME,...,NAME representing a combination of the following flag names:

   0x1     1  PAIRED         paired-end / multiple-segment sequencing technology
   0x2     2  PROPER_PAIR    each segment properly aligned according to aligner
   0x4     4  UNMAP          segment unmapped
   0x8     8  MUNMAP         next segment in the template unmapped
  0x10    16  REVERSE        SEQ is reverse complemented
  0x20    32  MREVERSE       SEQ of next segment in template is rev.complemented
  0x40    64  READ1          the first segment in the template
  0x80   128  READ2          the last segment in the template
 0x100   256  SECONDARY      secondary alignment
 0x200   512  QCFAIL         not passing quality controls or other filters
 0x400  1024  DUP            PCR or optical duplicate
 0x800  2048  SUPPLEMENTARY  supplementary alignment

This means reads with supplementary alignments will always be filtered out regardless of whether they are a PCR duplicate or not when the filterDuplicates function is called.

This doesn't seem intentional?

tinyheero avatar Aug 06 '24 22:08 tinyheero

Thanks so much @tinyheero. This code is indeed ancient and I'm not remembering why I implemented it that way. I will look into the impact. Gut feeling should not matter, right? Or do you have a concern here?

lima1 avatar Aug 07 '24 14:08 lima1

You're right in that the effects are minor as supplementary alignments make a small proportion of the total reads.

The reason why I spotted it was because I noticed several regions with much lower counts that expected and it was simply because all the reads with supplementary alignments were not being counted. Regions with many supplementary alignments tend to be associated with SV events so it makes sense to actually retain them in my opinion.

tinyheero avatar Aug 08 '24 16:08 tinyheero

I'll go through my collections of normal and cancer test samples. If it's generally safe to add them even when normal samples do not, it probably makes indeed sense to retain.

lima1 avatar Aug 08 '24 16:08 lima1