MACS
MACS copied to clipboard
Bug:
Describe the bug Bug in randsample
To Reproduce A00679:511:HWJGJDSXY:4:2104:21847:12399 147 Chr1A 89 29 98M = 113 -74 CCCTAACCTAAACCTACCTAAACCATACATTGTTCCATTGGCCAGAGGCTGTTCACCTTGGAGACCTGATGCGGTTATGAGTACGACCGGGCGTGAAC FFFF::FFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF XA:Z:Chr1A_part1,-183855,6M1D92M,1;Chr1A_part1,-361405,6M1D92M,1;Chr1A_part1,+321535,90M1D8M,1;Chr1A_part1,-227337,6M1D92M,1; MC:Z:25S73M MD:Z:98 PG:Z: NM:i:0 AS:i:98 XS:i:92 A00679:511:HWJGJDSXY:4:2104:21847:12399 99 Chr1A 113 29 25S73M = 89 74 CACGTGAAGGACCAGAGCAACAGCTATACATTGTTCCATTGGCCAGAGGCTGTTCACCTTGGAGACCTGATGCGGTTATGAGTACGACCGGGCGTGAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF MC:Z:98M MD:Z:73 PG:Z: NM:i:0 AS:i:73 XS:i:73
Expected behavior Chr1A 88 186 But randsample give: Chr1A 88 162
Screenshots If applicable, add screenshots to help explain your problem.
System (please complete the following information):
- OS: Ubuntu 22.04.1
- Python version 3.10.6
- Numpy version 1.22.4
- MACS Version macs3 3.0.0b2 @0c161d5
Additional context The read pairs look outward. But a pysam-based script give 'Chr1A 88 186' which is consistent with manual check by IGV. Where the '162' was calculated from? Was it related to the soft clipping? I guess for a fragment, is it proper to just calculated minimum left and maximum right of the paired reads for each fragment?
When MACS reads a paired-end alignment, it will only look at the TLEN
value and smallest POS
value of the pair to decide where the leftmost and the rightmost position of the template are. As in this case, the absolute value of TLEN
is 74, and the leftmost position of the template is 89 (1-based in SAM and we will use 88 in MACS since we use 0-based and right-exclusive coordination). Therefore MACS read it as a template from position 88 to position 162. Yes, of course, it's not right since the first mate is mapped to plus strand from position 112 to 185( I use 0-based coordination as in BED format here), and the second mate is mapped to minus strand from position 88 to 186. And the fact of this odd orientation of read pair confused our algorithm. As you pointed out, this pair looks outward. Maybe the first 25 soft-clipped bases from the first mate (the second row) may be aligned so that the first mate should span from 87 to 185, and the second mate span from 88 to 186 (keep using 0-based coordination) -- look a bit more 'realistic'. Or the last 25bps of the second mate should be clipped away, so the second mate should span from 113 to 186. Since there are so many possibilities when an aligner decides when to do soft-clipping and the TLEN
value, I would recommend you use a customized script to parse the start and end position from the BAM/SAM file into a simpler BED format file and use that as input for MACS -- if this particular case is important for the peak calling. In reality, this should be a rare case I assume.