slamdunk
slamdunk copied to clipboard
Extra tags + HISAT3 +
Hola -
So what I want to do is reproduce GRAND-slam's method for calculating NTR without having to use GRAND-slam.
I think the tags generated by NGM would easily give the minimal statistics to calculate directly using the same beta-binomial approach without having to do the additional faff of remapping, building an aligner,calculating myself because I am a lazy person:
TC:i:
Numberof T>C mismatches in a read (A>G if read is on reverse strand)
RA:Z:
Comma-separated integer array, each position marking a specific conversion type.
MP:Z:
Comma-separated array of mismatch positions, each position 3 colon-separated values in the format of <type>:<read position>:<reference position> where type is the same as in the RA:Z tag.
But NGM is non-splice aware/single-end focused tool, and my interests are comparing spliced and unspliced reads....
so I couldn't really just use those tags from an NGM aligned bam.
So back..2019 #58 you said "Hisat team should be implementing our strategy" (in a couple months, lol XD)
And in issue #99 you say "HISAT-3" should generate the necessary tags to do downstream analysis
From eyeballing the HISAT-3N sam files and reading their docs the new tags brought in are
Extra SAM tags generated by HISAT-3N:
Yf:i:<N>: Number of conversions are detected in the read.
YZ:A:<A>: The value + or – indicate the read is mapped to REF-3N (+) or REF-RC-3N (-).
So I can see that Yfi == TC:i:
But what about RA:z and MP:z ? Were these tags NGM tags, or did one of the other dunks produce them later on?
Sorry for the q that probably could be answered by reading the docs more throughly
Cheers and thanks
Hi @aleighbrown - those tags you are talking about are indeed produced by NGM and used by the other dunks later on - specifically MP:Z to read out the mismatch positions to extract the T>C conversions.
I am myself evaluating HISAT-3N currently outside of the SLAMdunk framework only, so I would have to look into the BAM tags myself first, if something similar as MD:Z is provided.
The tags looks something like this:
A00420:310:HWVTKDMXX:2:2164:6262:32784 419 chr1 14442 1 56M = 14442 0 TCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF AS:i:0 NH:i:3 XM:i:0 NM:i:0 MD:Z:56 YS:i:0 YZ:A:+ Yf:i:0 ZS:i:0 XN:i:0 XO:i:0 XG:i:0
And using I'm preeetty sure you could use the output of the hisat-3n-table
ref pos strand convertedBaseQualities convertedBaseCount unconvertedBaseQualities unconvertedBaseCount
chr1 10007 + 0 FFFFF 5
chr1 10013 + 0 FFFFFF 6
chr1 10019 + 0 FFFFFF 6
chr1 10025 + 0 FFFFFF 6
chr1 10031 + 0 FFFFFF 6
chr1 10037 + 0 FFFF:F 6
chr1 10043 + 0 FFFFFF 6
chr1 10049 + 0 FFFFF: 6
chr1 10055 + 0 FFFF,F 6
For the slam-dunk logic directy, but since you're only preserving the T>C (or A>G) changes, you're lacking the error measurement..which means calculating the it from the MD:Z tag as you say
Just following up,
I've about got a script which could add on the RA and MP:Z tags to a HISAT3N aligned bam
Just perusing the tags on some bams I had previously aligned with NGM
RA:Z:59,0,1,0,0,0,55,1,0,0,2,0,55,0,0,0,2,0,60,0,0,0,0,0,0
MP:Z:2:38:38,16:70:70,10:80:80,7:110:110,10:189:189,16:202:202
So for the MP tag we have 6 mismatches 2:38:38 16:70:70 10:80:80 7:110:110 10:189:189 16:202:202
First value is the type, then both of the second values appear to be the position in the read? Is that correct? Cheers and thanx
Hi,
a detailled description is in the supplement of the paper.
TC:i:
Number of T>C mismatches in a read (A>G if read is on reverse strand)
RA:Z:
Comma-separated integer array, each position marking a specific conversion type.
Read Reference
A C G T N
A 0 1 2 3 4
C 5 6 7 8 9
G 10 11 12 13 14
T 15 16 17 18 19
N 20 21 22 23 24
MP:Z:
Comma-separated array of mismatch positions, each position 3 colon-separated
values in the format of <type>:<read position>:<reference position> where type is the
same as in the RA:Z tag
Yep! I've read that supplement, but if you see the MP tags produced by NGM e.g.
MP:Z:2:38:38,16:70:70,10:80:80,7:110:110,10:189:189,16:202:202
How does a reference position of 38 follow? As a guess, I'm going out and saying it should be 38 from the left most of the read
e.g. for this read
A00420:113:HK5VMDRXX:1:2108:16116:23077 83 chr1 10541 0 65M = 10556 -50 CCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF RG:Z:TDPKD_1_12h AS:i:625 NM:i:1 NH:i:0 XI:f:0.9846 X0:i:0 XE:i:14 XR:i:65 MD:Z:19C45 TC:i:0 RA:Z:11,0,0,0,0,0,20,1,0,0,0,0,21,0,0,0,0,0,12,0,0,0,0,0,0 MP:Z:7:20:20
the 20 in the reference position would refer to chr1:10541 + 20?
Yes, sorry those should be relative positions indeed