minimap2 Question about matching minimizers in all-vs-all overlaps of ONT RNA-seq reads

Hi Heng,

I use the -x ava-ont option in minimap2 to get all-vs-all overlaps of ONT RNA-seq reads.

I also output the matching minimizers’ positions in each read for a matched read-pair.

In the output file, I found there are large gaps between groups of matching minimizers in each read. For example, in the following matched read-pair (RNA-seq reads), there are 3 large gaps (~300, 168, and 389 bp) and 3 medium/small gaps (~65, 48, and 21 bp).

e86d4fab-e29d-4c34-b741-c0cb61f58051    1270    195     1232    +       30bde1d1-372a-47ae-9901-16319f81d340    1081    21      1048    136     1037    0       cq:B:I,209,210,218,220,224,524,525,530,532,535,703,705,706,707,708,1097,1162,1210,1231        cr:B:I,35,36,44,46,50,348,349,354,356,359,526,528,529,530,531,920,982,1027,1047 tp:A:S  cm:i:19 s1:i:136     dv:f:0.1556      rl:i:889

where cq and cr tags give the matching minimizers’ ending positions on the query and target reads, respectively. Since I use -k15 -w5, I consider the minimizers’ distance > 10 as gaps.

Originally, I thought the large gaps may be caused by alternative splicing, while small gaps may be due to sequencing errors. However, when I use -x ava-ont to get all-vs-all overlaps of ONT DNA reads, I also found there are large gaps between groups of matching minimizers in each read. For example, in the following matched read-pair (DNA reads), there are many large gaps --- some gaps are as large as ~1328 and 1875 bp.

1b113957-72f0-4a56-86bd-f5f6b28be178    7489    954     6651    +       19a4fae9-983b-4ccb-a13b-cc52c7ec10b6    11312   701     6451  365      5765    0       cq:B:I,968,970,1034,2362,2732,2895,3147,3151,3154,3274,5125,5126,5127,5128,5254,5666,5673,5783,5788,5790,5823,5960,5964,6019,6020,6022,6024,6025,6028,6033,6302,6306,6435,6440,6442,6445,6500,6523,6528,6547,6549,6650 cr:B:I,715,717,785,2110,2482,2654,2910,2914,2917,3043,4918,4919,4920,4921,5050,5455,5462,5575,5580,5582,5614,5750,5754,5811,5812,5814,5816,5817,5820,5825,6099,6103,6233,6238,6240,6243,6300,6326,6331,6350,6352,6450 tp:A:S  cm:i:42 s1:i:352        dv:f:0.2039     rl:i:4566

So I have the following questions and would appreciate your advice: (1) What do these large gaps between minimizers mean? Does a large gap mean there are no matching minimizers between two reads in this particular region? Or are these gaps related to the algorithm itself?

(2) I spot checked 20 matched read-pairs of RNA-seq reads. I found that in all these 20 read-pairs, both reads of a read-pair have the same structure of gaps in matching minimizers. For example, in the above first example read-pair, both reads have the structure of “5-minimizers + gap + 5-minimizers + gap + 5-minimizers + gap + 1-minimizer + gap + 1-minimizer + gap + 1-minimizer + gap + 1-minimizer”. I found that the corresponding gap sizes are also very close in both reads of a read-pair; the largest gap-size difference between the two reads is 24 bp in these 20 read-pairs. Do the two reads of a matched read-pair always have the same structure of gaps in matching minimizers?

(3) Since RNA-seq reads may come from different isoforms, I was expecting, in some matched read-pairs, the two reads would have different structures of gaps in matching minimizers if one read skips an exon while the other read does not (or if one read retains an intron while the other read does not). For example, if read-1 has region-1, region-2, and region-3 sequentially, while read-2 has only region-1 and region-3 due to alternative splicing, I would expect there is a large gap in matching minimizers between region-1 and region-3 in read-1, while there is no such a gap between region-1 and region-3 in read-2. However, so far I have not found a matched read-pair in which the two reads have different structures of gaps in matching minimizers. So I was wondering if this is the way the algorithm was designed or it is just that I have not encountered such a matched read-pair from different isoforms?

(4) If a large gap means there are no matching minimizers between two reads in this particular region, what would be the cause of these gaps? If a gap may be caused by either sequencing errors or alternative splicing, it would be difficult to determine whether the two reads come from different isoforms. In this case, can we at least say the two reads come from different isoforms if they have different structures of gaps in matching minimizers or if they have a significant gap-size difference (e.g. > 50) in any of their corresponding gaps?

Thank you very much!

Oct 09 '21 20:10 lauraht

I would greatly appreciate your advice. I look forward to hearing from you.

Thank you very much!

Oct 21 '21 17:10 lauraht

@lauraht Hi, I have similar issues, do you have the answers? And may I ask what parameters you used to match reads in cDNA RNA-seq ONT data? Is the default ava-ont?

May 17 '23 06:05 alexyfyf

minimap2 minimap2 copied to clipboard

Question about matching minimizers in all-vs-all overlaps of ONT RNA-seq reads

minimap2
minimap2 copied to clipboard