minimap2
minimap2 copied to clipboard
Question about matching minimizers in all-vs-all overlaps of ONT RNA-seq reads
Hi Heng,
I use the -x ava-ont
option in minimap2 to get all-vs-all overlaps of ONT RNA-seq reads.
I also output the matching minimizers’ positions in each read for a matched read-pair.
In the output file, I found there are large gaps between groups of matching minimizers in each read. For example, in the following matched read-pair (RNA-seq reads), there are 3 large gaps (~300, 168, and 389 bp) and 3 medium/small gaps (~65, 48, and 21 bp).
e86d4fab-e29d-4c34-b741-c0cb61f58051 1270 195 1232 + 30bde1d1-372a-47ae-9901-16319f81d340 1081 21 1048 136 1037 0 cq:B:I,209,210,218,220,224,524,525,530,532,535,703,705,706,707,708,1097,1162,1210,1231 cr:B:I,35,36,44,46,50,348,349,354,356,359,526,528,529,530,531,920,982,1027,1047 tp:A:S cm:i:19 s1:i:136 dv:f:0.1556 rl:i:889
where cq
and cr
tags give the matching minimizers’ ending positions on the query and target reads, respectively. Since I use -k15 -w5
, I consider the minimizers’ distance > 10 as gaps.
Originally, I thought the large gaps may be caused by alternative splicing, while small gaps may be due to sequencing errors. However, when I use -x ava-ont
to get all-vs-all overlaps of ONT DNA reads, I also found there are large gaps between groups of matching minimizers in each read. For example, in the following matched read-pair (DNA reads), there are many large gaps --- some gaps are as large as ~1328 and 1875 bp.
1b113957-72f0-4a56-86bd-f5f6b28be178 7489 954 6651 + 19a4fae9-983b-4ccb-a13b-cc52c7ec10b6 11312 701 6451 365 5765 0 cq:B:I,968,970,1034,2362,2732,2895,3147,3151,3154,3274,5125,5126,5127,5128,5254,5666,5673,5783,5788,5790,5823,5960,5964,6019,6020,6022,6024,6025,6028,6033,6302,6306,6435,6440,6442,6445,6500,6523,6528,6547,6549,6650 cr:B:I,715,717,785,2110,2482,2654,2910,2914,2917,3043,4918,4919,4920,4921,5050,5455,5462,5575,5580,5582,5614,5750,5754,5811,5812,5814,5816,5817,5820,5825,6099,6103,6233,6238,6240,6243,6300,6326,6331,6350,6352,6450 tp:A:S cm:i:42 s1:i:352 dv:f:0.2039 rl:i:4566
So I have the following questions and would appreciate your advice: (1) What do these large gaps between minimizers mean? Does a large gap mean there are no matching minimizers between two reads in this particular region? Or are these gaps related to the algorithm itself?
(2) I spot checked 20 matched read-pairs of RNA-seq reads. I found that in all these 20 read-pairs, both reads of a read-pair have the same structure of gaps in matching minimizers. For example, in the above first example read-pair, both reads have the structure of “5-minimizers + gap + 5-minimizers + gap + 5-minimizers + gap + 1-minimizer + gap + 1-minimizer + gap + 1-minimizer + gap + 1-minimizer”. I found that the corresponding gap sizes are also very close in both reads of a read-pair; the largest gap-size difference between the two reads is 24 bp in these 20 read-pairs. Do the two reads of a matched read-pair always have the same structure of gaps in matching minimizers?
(3) Since RNA-seq reads may come from different isoforms, I was expecting, in some matched read-pairs, the two reads would have different structures of gaps in matching minimizers if one read skips an exon while the other read does not (or if one read retains an intron while the other read does not). For example, if read-1 has region-1, region-2, and region-3 sequentially, while read-2 has only region-1 and region-3 due to alternative splicing, I would expect there is a large gap in matching minimizers between region-1 and region-3 in read-1, while there is no such a gap between region-1 and region-3 in read-2. However, so far I have not found a matched read-pair in which the two reads have different structures of gaps in matching minimizers. So I was wondering if this is the way the algorithm was designed or it is just that I have not encountered such a matched read-pair from different isoforms?
(4) If a large gap means there are no matching minimizers between two reads in this particular region, what would be the cause of these gaps? If a gap may be caused by either sequencing errors or alternative splicing, it would be difficult to determine whether the two reads come from different isoforms. In this case, can we at least say the two reads come from different isoforms if they have different structures of gaps in matching minimizers or if they have a significant gap-size difference (e.g. > 50) in any of their corresponding gaps?
Thank you very much!
I would greatly appreciate your advice. I look forward to hearing from you.
Thank you very much!
@lauraht Hi, I have similar issues, do you have the answers? And may I ask what parameters you used to match reads in cDNA RNA-seq ONT data? Is the default ava-ont
?