minimap2 icon indicating copy to clipboard operation
minimap2 copied to clipboard

Forcing continuous alignments

Open rlorigro opened this issue 3 years ago • 12 comments

Hi,

I am using minimap2 to map long ONT reads (up to 400Kbp), and I am finding that the supplementary chaining is producing large rearrangements in the read sequence, flipping strands, and jumping across regions. I have tried using the -m parameter to limit chaining, and several other parameters, but none of them seem to force the primary alignment to be linear and unidirectional.

I should emphasize that the reference I am aligning to is the same individual that the reads are derived from, so I expect there to be no SVs. Here is one example:

Read       Length Start  Stop      Region RefLength  Start      Stop
2626312    448022 112    922    +  chr2   242696747  130488450  130489279
2626312    448022 7047   7861   +  chr2   242696747  130489870  130490701
2626312    448022 41024  447988 +  chr2   242696747  130533922  130950797
2626312    448022 3105   3985   +  chr2   242696747  130885322  130886221
2626312    448022 12041  40959  -  chr2   242696747  131672947  131702229
2626312    448022 5227   5854   -  chr2   242696747  131710435  131711099
2626312    448022 2039   2944   -  chr2   242696747  131707727  131708642

Is there a parameter (or combination of parameters) that would help with this? I think what is needed is to allow the primary alignment to extend further.

Thanks

rlorigro avatar Jan 19 '21 17:01 rlorigro

I should also add that this is a repetitive region, and that I have been using winnowmap to accommodate for that.

rlorigro avatar Jan 19 '21 18:01 rlorigro

What is the winnowmap2 output?

lh3 avatar Jan 19 '21 19:01 lh3

The example I posted was using winnowmap already. In that case, I aligned HG002 reads to CHM13 v1.0, but in future runs I aligned HG002 reads to the HiFiasm HG002 assembly (~190x CCS coverage from here) and observed a similar problem, in other reads. More detailed information:

Command:

meryl count k=15 output merylDB /home/ryan/data/hifi/human/hg002/HG002.mat.fa
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

winnowmap \
-W /home/ryan/data/hifi/human/hg002/repetitive_k15.txt \
-x map-ont \
-t 7 \
--secondary=no \
/home/ryan/data/hifi/human/hg002/HG002.mat.fa \
/home/ryan/data/nanopore/human/test/overlap_guppy_360_tangle_subset_1/HG002-Guppy-3.6.0-UL-run3-subset1.fasta \
> HG002-Guppy-3.6.0-UL-run3-subset1_VS_HG002_mat_primary.paf

stderr:

[M::mm_idx_gen::0.035*0.04] reading downweighted kmers
[M::mm_idx_gen::0.065*0.24] collected downweighted kmers, no. of kmers read=66775
[M::mm_idx_gen::0.065*0.24] saved the kmers in a bloom filter: hash functions=2 and size=960072 
[M::mm_idx_gen::98.885*1.08] collected minimizers
[M::mm_idx_gen::100.291*1.16] sorted minimizers
[M::main::100.291*1.16] loaded/built the index for 546 target sequence(s)
[M::main::100.291*1.16] running winnowmap in SV-aware mode
[M::main::100.291*1.16] stage1-specific parameters minP:1000, incP:4.00, maxP:16000, sample:1000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::100.291*1.16] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 546
[M::mm_idx_stat::100.479*1.16] distinct minimizers: 23589668 (41.44% are singletons); average occurrences: 5.194; average spacing: 25.013
[M::worker_pipeline::111.965*1.78] mapped 2273 sequences
[M::main] Version: 2.0, pthreads=7, omp_threads=3
[M::main] CMD: winnowmap -W /home/ryan/data/hifi/human/hg002/repetitive_k15.txt -x map-ont -t 7 --secondary=no /home/ryan/data/hifi/human/hg002/HG002.mat.fa /home/ryan/data/nanopore/human/test/overlap_guppy_360_tangle_subset_1/HG002-Guppy-3.6.0-UL-run3-subset1.fasta
[M::main] Real time: 112.009 sec; CPU: 199.570 sec; Peak RSS: 3.884 GB

And some weird alignments I observed:

Jumping between contigs (not bridging contig termini):

736840	41280	2057	39942	+	h2tg000102l	36105583	464829	502652	7336	38234	60	tp:A:P	cm:i:541	s1:i:7250	s2:i:802	rl:i:0
736840	41280	75	968	+	h2tg000005l	111658231	629986	630886	327	910	60	tp:A:P	cm:i:27	s1:i:324	s2:i:0	rl:i:0

Reversing strand on the same contig:

207028	74088	33102	73900	+	h2tg000102l	36105583	3378	44916	7102	41787	60	tp:A:P	cm:i:510	s1:i:6921	s2:i:0	rl:i:0
207028	74088	70	36902	-	h2tg000102l	36105583	43512	80920	5749	37648	60	tp:A:P	cm:i:415	s1:i:5587	s2:i:194	rl:i:0

Jumping intra + inter contig

2027405	47862	34	37932	-	h2tg000005l	111658231	106444	145003	6510	38787	60	tp:A:P	cm:i:481	s1:i:6327	s2:i:968	rl:i:0
2027405	47862	39262	39765	-	h2tg000005l	111658231	712065	712577	90	519	56	tp:A:P	cm:i:6	s1:i:85	s2:i:0	rl:i:0
2027405	47862	41040	47808	-	h2tg000102l	36105583	146693	153675	614	7023	60	tp:A:P	cm:i:43	s1:i:561	s2:i:0	rl:i:0

rlorigro avatar Jan 19 '21 20:01 rlorigro

There could still be SVs because the reads are not binned, and this is the maternal haplotype from HiFiasm, but in general I would just like to forbid chaining with large gaps and favor extending the primary alignment.

rlorigro avatar Jan 19 '21 20:01 rlorigro

If you can give me just these reads, I can have a look. Nonetheless, I suspect those are the best you can get.

I should emphasize that the reference I am aligning to is the same individual that the reads are derived from, so I expect there to be no SVs.

Assembly can be wrong. Reads can be chimeric both locally and globally.

lh3 avatar Jan 19 '21 23:01 lh3

OK, here are the 3 reads from above: https://rlorigro-public-files.s3-us-west-1.amazonaws.com/reads/guppy_360/misaligned_reads_guppy_3.6.0_HG002.fasta

There are many more examples like these so I would be surprised if they are all chimeras. These reads are also dumped from Shasta, which is usually done after some filtering for chimerism. I think the issue is that there are repeats in this region, and the primary/supplementary segments are being split amongst the repeats.

Also I noticed that winnowmap is running with "sv-aware mode", but I couldn't find more info about this. Does this affect chaining and supplementaries? Maybe a question for @cjain7

rlorigro avatar Jan 20 '21 00:01 rlorigro

I guess --sv-aware implements the algorithm in the winnowmap2 preprint.

lh3 avatar Jan 20 '21 00:01 lh3

Ok i will take a closer look at the paper. I just tried rerunning with --sv-off and it appears to have reduced these rearrangements.

rlorigro avatar Jan 20 '21 00:01 rlorigro

With sv-off I do still get some strange ones though, unfortunately:

2497450	60236	146	29307	-	h2tg000005l	111658231	848222	878809	3055	30728	46	tp:A:P	cm:i:219	s1:i:2764	s2:i:2331	rl:i:636
2497450	60236	12923	60193	-	h2tg000102l	36105583	312140	361719	8282	49837	60	tp:A:P	cm:i:616	s1:i:7919	s2:i:543	rl:i:636

After looking some more at the paper, I am wondering if disabling sv-aware mode will defeat the purpose of using winnowmap over minimap.

rlorigro avatar Jan 20 '21 02:01 rlorigro

@rlorigro wfmash -m -N -p 90 will generate contiguous approximate mappings of nanopore reads. You can remove -m to get a base-level alignment obtained by global pairwise alignment of each mapping result. wfmash lacks mapping quality, but it could provide you a baseline here. We're just beginning tests on non-assembly inputs, so please let me know if you try it and it works for you.

ekg avatar Mar 09 '21 08:03 ekg

Thanks @ekg, this is interesting. As it turns out, I've been digging into these issues more and have found that @lh3 was right about many of them being chimeric. A majority of the chimeras were palindromes, where they reverse direction about halfway through the sequence, and a smaller portion of them were conventional chimeras, where a chunk of the read appears to have been a concatenation of totally unrelated sequence. For the palindromes, it seems that nanopore has issues with pulling in the complementary strand directly after the first strand finishes translocating, and the signal segmenter doesn't catch the transition, so both strands of the signal are sent to the basecaller as a single read.

There are occasionally still some unexplained artifacts, where supplementaries are almost entirely contained inside other alignments, but fortunately the mapQ score for these is very low or 0 in the cases I have observed so far.

So overall I would say the mappings were not issues with winnowmap (if SV mode is turned off), though I think it should still be possible to force more strict chaining.

rlorigro avatar Mar 09 '21 19:03 rlorigro

To be more specific about the remaining non-chimeric artifacts I mentioned, here are 2 examples:

14617	59497	29	59436	+	h2tg000052l	87420292	61492027	61545226	18078	59881	60	tp:A:P	cm:i:1359	s1:i:17961	s2:i:255	rl:i:1688
14617	59497	38972	44839	+	h2tg000043l	49469036	33482998	33488884	1103	5906	2	tp:A:P	cm:i:83	s1:i:1097	s2:i:1085	rl:i:1688
14617	59497	36265	41733	-	h2tg000036l	139977256	27376259	27381830	75	5589	14	tp:A:P	cm:i:5	s1:i:50	s2:i:0	rl:i:1688
1181401	72122	105	72059	-	h2tg000052l	87420292	61507216	61573955	10694	73316	60	tp:A:P	cm:i:775	s1:i:10400	s2:i:40	rl:i:1116
1181401	72122	43191	48674	-	h2tg000052l	87420292	43644377	43649923	488	5563	0	tp:A:P	cm:i:33	s1:i:472	s2:i:472	rl:i:1116
1181401	72122	40607	46177	-	h2tg000077l	81620956	67987710	67993283	60	5590	13	tp:A:P	cm:i:4	s1:i:51	s2:i:0	rl:i:1116

They aren't overlapping by more than 50% in the reference (as specified by the -M parameter), but in the query coordinate space, some of these mappings are contained entirely inside other mappings. The final supplementary alignment of read 1181401 has a surprisingly high mapQ of 13 despite this.

rlorigro avatar Mar 09 '21 20:03 rlorigro