minimap2 icon indicating copy to clipboard operation
minimap2 copied to clipboard

Order of operations in CIGAR

Open baraaorabi opened this issue 11 months ago • 4 comments

I found that Minimap2 sometimes generates an unusual order of CIGAR opertations; with some Ns, followed by some Is then followed by more Ns instead of the more expected order of having all the Ns combined together.

Here is an example that produces this result:

$ minimap2 -c -x splice -t 32 data/refs/homo_sapiens.dna.fa <(echo -e ">1\nCATGGACTCATCCAGGATACATCAGGAAC
TCAATGGCAAAAAAATCCAATTTAAAAATGAACAAAATACCTAAATAGACACCTCTCAAAGAAGAAACACAAGGCCAGGCGCAGTGGCTCATGCCTATAATCCCAGCACTTTGGGAGGCTGAGGCAGGAGGATTGCCTGAGCTCGGGAGTTCGAGACC
AGCCTGGGCAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCTGGGCATGGTGGCGCACGCGTGTACGCCAGCTACTGGAAGGCTGAGACAGGAGAATCACTGGAACCTGGGAGGCGGAGGTTTCAGTAAGCCAAGATTGCACCACTG
CACTCCAGCCTGGGTGACAAAGCAAGACTCCATCTCAAAAAAAAAAAAAAAAAAAAAAAAAGCTAGATTCACAACATCGATGTTCCGGAGATCGGAAGAGCGTCGTGTAGAGGTTAAACACCCAAGCAGACGCCGCAATATCAGCACCAACAGAAAAC
AGACGACTACAAACGGAATCGAGTCACGGTAGGCGATAATCATAAACACCAGCTCCTAGCAACTGAACGAAGCACACTGATGACAAGAAAGTTGTCGGTGTCTTTGTGACTTGCCTGTCGCTCTATCTTCGGCGTCTGCTTGGGTGTTTAACCTCTGC
CACGACGCTCTTCCGATCTCAGCAGCAGATGCTTCTATTCTCTCGGTTTTTTTTTTTTTTTTTTTTTTTTTTACATTTCAAAATATTTAACAAAGTCAAACTTTCTCACCATGGTTTCAGTTTAGTGGAAGCATTTACTAAAGTACAAAAAGCCTCAG
AAAACGTGATGGGCAATATCTGGGCCCCAAGTTACCAGAAAGGGCACCAGCCAATATAGCACTGGCAGAGGTTTTCATGGGATGTCGCTTGTTTGATGAGCAGCTCAACTTGCGTTGGAACATTCAAAGTGTCATCATGAGAGAAGTCCCGACCAGTG
AGCTTATCTCTGAACCCTGTTAATAATCTGATAGCTTTTCTTCCTGGCGTGTACTCTGCGTTGATACCACCAAGCTAGGTTAAACACCCAAGCAGACGCC")
[M::mm_idx_gen::89.011*1.72] collected minimizers
[M::mm_idx_gen::99.459*3.62] sorted minimizers
[M::main::99.459*3.62] loaded/built the index for 194 target sequence(s)
[M::mm_mapopt_update::106.313*3.45] mid_occ = 765
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 194
[M::mm_idx_stat::109.605*3.38] distinct minimizers: 167225302 (35.46% are singletons); average occurrences: 6.030; average spacing: 3.074; total length: 3099750718
1       1077    101     1015    +       1       248956422       10975125        11109332        438     920     39      NM:i:482        ms:i:332        AS:i:182      nn:i:0  ts:A:-  tp:A:P  cm:i:50 s1:i:219        s2:i:101        de:f:0.0894     rl:i:320        cg:Z:61M2I99M2D1M1I9M1D24M183N437I131864N72M1I11M2D14M1I32M680N106M1003N15M1I18M1D9M\

You can see that after the first exon, 101S61M2I99M2D1M1I9M1D24M, there are 183N skip, followed by 437I insertion, followed by another skip of 131864N.

baraaorabi avatar Sep 27 '23 20:09 baraaorabi

P.S.: This might be related to #502?

baraaorabi avatar Sep 27 '23 20:09 baraaorabi

Yes, related but not the same. We may move 437I around in this case, though there will still be a CIGAR like xxxIyyyN, which is hard to resolve.

lh3 avatar Nov 02 '23 14:11 lh3

Can it not be resolved by linear time postprocessing? Something like anochoring the (mis)matches in the CIGAR string, and then sorting-and-merging any other CIGAR operations between (mis)match anchors?

baraaorabi avatar Nov 03 '23 18:11 baraaorabi

No, can't.

lh3 avatar Nov 03 '23 18:11 lh3