slamdunk icon indicating copy to clipboard operation
slamdunk copied to clipboard

Extra tags + HISAT3 +

Open aleighbrown opened this issue 4 years ago • 6 comments

Hola -

So what I want to do is reproduce GRAND-slam's method for calculating NTR without having to use GRAND-slam.

I think the tags generated by NGM would easily give the minimal statistics to calculate directly using the same beta-binomial approach without having to do the additional faff of remapping, building an aligner,calculating myself because I am a lazy person:

TC:i:
Numberof T>C mismatches in a read (A>G if read is on reverse strand)
RA:Z:
Comma-separated integer array, each position marking a specific conversion type.
MP:Z:
Comma-separated array of mismatch positions, each position 3 colon-separated values in the format of <type>:<read position>:<reference position> where type is the same as in the RA:Z tag.

But NGM is non-splice aware/single-end focused tool, and my interests are comparing spliced and unspliced reads....

so I couldn't really just use those tags from an NGM aligned bam.

So back..2019 #58 you said "Hisat team should be implementing our strategy" (in a couple months, lol XD)

And in issue #99 you say "HISAT-3" should generate the necessary tags to do downstream analysis

From eyeballing the HISAT-3N sam files and reading their docs the new tags brought in are

Extra SAM tags generated by HISAT-3N:

    Yf:i:<N>: Number of conversions are detected in the read.

    YZ:A:<A>: The value + or – indicate the read is mapped to REF-3N (+) or REF-RC-3N (-).

So I can see that Yfi == TC:i:

But what about RA:z and MP:z ? Were these tags NGM tags, or did one of the other dunks produce them later on?

Sorry for the q that probably could be answered by reading the docs more throughly

Cheers and thanks

aleighbrown avatar Oct 22 '21 08:10 aleighbrown

Hi @aleighbrown - those tags you are talking about are indeed produced by NGM and used by the other dunks later on - specifically MP:Z to read out the mismatch positions to extract the T>C conversions.

I am myself evaluating HISAT-3N currently outside of the SLAMdunk framework only, so I would have to look into the BAM tags myself first, if something similar as MD:Z is provided.

t-neumann avatar Oct 22 '21 11:10 t-neumann

The tags looks something like this:

A00420:310:HWVTKDMXX:2:2164:6262:32784	419	chr1	14442	1	56M	=	14442	0	TCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATG	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF	AS:i:0	NH:i:3	XM:i:0	NM:i:0	MD:Z:56	YS:i:0	YZ:A:+	Yf:i:0	ZS:i:0	XN:i:0	XO:i:0	XG:i:0

And using I'm preeetty sure you could use the output of the hisat-3n-table

ref	pos	strand	convertedBaseQualities	convertedBaseCount	unconvertedBaseQualities	unconvertedBaseCount
chr1	10007	+		0	FFFFF	5
chr1	10013	+		0	FFFFFF	6
chr1	10019	+		0	FFFFFF	6
chr1	10025	+		0	FFFFFF	6
chr1	10031	+		0	FFFFFF	6
chr1	10037	+		0	FFFF:F	6
chr1	10043	+		0	FFFFFF	6
chr1	10049	+		0	FFFFF:	6
chr1	10055	+		0	FFFF,F	6

For the slam-dunk logic directy, but since you're only preserving the T>C (or A>G) changes, you're lacking the error measurement..which means calculating the it from the MD:Z tag as you say

aleighbrown avatar Oct 22 '21 13:10 aleighbrown

Just following up,

I've about got a script which could add on the RA and MP:Z tags to a HISAT3N aligned bam

Just perusing the tags on some bams I had previously aligned with NGM

RA:Z:59,0,1,0,0,0,55,1,0,0,2,0,55,0,0,0,2,0,60,0,0,0,0,0,0	
MP:Z:2:38:38,16:70:70,10:80:80,7:110:110,10:189:189,16:202:202

So for the MP tag we have 6 mismatches 2:38:38 16:70:70 10:80:80 7:110:110 10:189:189 16:202:202

First value is the type, then both of the second values appear to be the position in the read? Is that correct? Cheers and thanx

aleighbrown avatar Oct 29 '21 15:10 aleighbrown

Hi,

a detailled description is in the supplement of the paper.

TC:i:
Number of T>C mismatches in a read (A>G if read is on reverse strand)
RA:Z:
Comma-separated integer array, each position marking a specific conversion type. 
Read Reference
A C G T N
A 0 1 2 3 4
C 5 6 7 8 9
G 10 11 12 13 14
T 15 16 17 18 19
N 20 21 22 23 24
MP:Z:
Comma-separated array of mismatch positions, each position 3 colon-separated
values in the format of <type>:<read position>:<reference position> where type is the
same as in the RA:Z tag

t-neumann avatar Nov 03 '21 12:11 t-neumann

Yep! I've read that supplement, but if you see the MP tags produced by NGM e.g.

MP:Z:2:38:38,16:70:70,10:80:80,7:110:110,10:189:189,16:202:202

How does a reference position of 38 follow? As a guess, I'm going out and saying it should be 38 from the left most of the read

e.g. for this read

A00420:113:HK5VMDRXX:1:2108:16116:23077	83	chr1	10541	0	65M	=	10556	-50	CCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	RG:Z:TDPKD_1_12h	AS:i:625	NM:i:1	NH:i:0	XI:f:0.9846	X0:i:0	XE:i:14	XR:i:65	MD:Z:19C45	TC:i:0	RA:Z:11,0,0,0,0,0,20,1,0,0,0,0,21,0,0,0,0,0,12,0,0,0,0,0,0	MP:Z:7:20:20

the 20 in the reference position would refer to chr1:10541 + 20?

aleighbrown avatar Nov 03 '21 13:11 aleighbrown

Yes, sorry those should be relative positions indeed

t-neumann avatar Nov 03 '21 13:11 t-neumann