minimap2 icon indicating copy to clipboard operation
minimap2 copied to clipboard

splice2bed sometimes creates bed files with blockszie of 0

Open SichongP opened this issue 3 years ago • 3 comments

Minimap2 can sometimes output alignment with a CIGAR string like this:

...320400N301I213807N...

And when using splice2bed to convert this alignment to a bed file, it creates a block with size of 0. This is because in processing CIGAR strings, insertions are (rightfully) skipped. But when two N segments surround an I segment, it results in an empty block. This can break many programs that work with bed12 files as 0-lenght blocks don't seem to be part of the standards?

I'm not sure if skipping the 301I block is a good idea since that obviously drops some useful information. Have you considered this situation? What are your thoughts?

SichongP avatar May 06 '21 02:05 SichongP

Do you have the sequences that trigger the issue? Thanks.

lh3 avatar May 10 '21 16:05 lh3

HI ! I have the same issue on multiple sequences. I can't use bedtobigbed to include my data in a UCSC trackhub. Each time, it's the same motif N->I->N in the CIGAR. Here is a short one : 000113F 421344 470823 0252ea69-0582-4d1a-a6b7-3050906cbda6 20 + 421344 470823 255,0,0 3 282,0,15 0,385,49464

With the CIGAR : 145S22M1I65M19D84M1I33M1I25M1D29M4D 103N 343I 49079N 15M204S

Here is the sequence : 0252ea69-0582-4d1a-a6b7-3050906cbda6 0 000113F 421345 20 145S22M1I65M19D84M1I33M1I25M1D29M4D103N343I49079N15M204S * 0 0 TTGCGCGTTCGGCCCCAAGTTTGGGTGTTTATGGATAATATTCTGTTGACCAGGTAGAAAGAAGCGAAGAATCGGAACTTGCCCTGTCGCTCTATCTTCGGCGTCTGCTTGGGTGTTGCCCTAGTAGTGGTATCTTCCTGACGGGGTTTTCATCTGACCTGACCTGCTTCTACAATACAATTCAAGACAAGTCACCGTATTTCAATTTTCCAGAAGTCAAAATGGATGAATCAAGAAGCGAGGTAAGTAATAATCAAAATATTAAATGAAAACGTTTTTGACTTTTGATCGATTGTAGTCATTGGTCTTAACTATTCAAAAAGATTTTCAATTGCTGGCCAATTGACTAAGTTTATCAGTCAAAATTTGCTCAGTGGGCCCCTCCCCCCCGCACTAATGACTCCAACAAAGCTGTACGACGTGTGTGAGCCTTTCGGGTCCGCTGCTCCACTCATTGGGCAATTTAACATTTAAAACACCATTTATGGACGGGACGAACGCGGCCGCGACGAAGCAGGTATACAGTCCTTATAGCCTATCAGGCGAGTGGGTGTCCAGATGGTTAATAAAAAAAGTGATTCAGACATCCCATAATATGATCCTGGCACTTGCTTCTTCACGCGTGACTACGACGTATTCCTTCTTTCTTCTTCTTCTTCTGTTCTGATATTCATAAAAAAAAAAAATGTTCCCAAAAGAAAATTTCCGAAAAAAAAAAAATTTTCCGAAAATTTTTTTCGAAAAAAAAGAAAAAATTTGAAAAAAAACAAATTCCGAAAAAAAAATTTGAAAAAAATCCGAAAAAAAATTCCAAAACAAAAAACATTTTCAAAAAATTCACAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCAGTTGATACACTGCTTAGGTTAAACACCCAAAGCGAACACCGCAATATATCAGCACCAACAGAAATCGATTCTGCTTCTTTCTACCTGGTCAC &&$"$&')$%$%%(#$%$&-235=:=@9442/''&&&')(&%$&77>:7898::=:8650)1$$$)/))7//0(21:;>>A88;;2AB;=@E/>=:<;67==>;9995631./.$,%%%%-%%&%$%&$#$(&$))/>CEB87::84224928:9:=<7<.:;;646A=@@?>A:@D::=?>AD>(0()8-;:24==?@@DFEBAA4-.,$,-(&%)'')-59:<A@>51/413::;@A@AAGAGA<=<ABB?>A@BB>8346/0/2447444100256<7CA@@=2-0++5770+(:LKJDD@BCG@??EB??@BAA>@B<AA=BD?A>433863/0.0:@:>;-1((+3374+E@J;785%%.%,625925?:=46D@A?74;934;9B@8?@9884-+()+/(%)6985,-1/01301289<>@29327A>E=;1><=@;<=<,-3<>A@A>59(.-$(1.1$%%'&$$$$%%%($)&%%%&/(&+(,)#))1356(&&)1552-1+(&'-'&&(%$&$'''(,-3234/=>D@;7,-?>@@@/++0-(-54-+089>3;::@:9:808<5:<;<>;70$,/64;9?<?@4,1510'''+-.&$)&%%//4967;9:?<@<7B@>B8>>>>>=:7)-=?CD>???85,%+>>B9:AAC;'(56857::875;;@=>=<441.)%4<>26//074+05432.28;;BA<5(%/>?FG;&<97=BEFED<9(::87;:99D@@CD71,&)9::@@JJI<%-<4=AABA9/%$.210.06..+49;4-+65:A09>@EE0+/?<%%@@EBAA@;=>A??8831-+)(&%%$)/..../,.$$-%$$(#$%%&%&&''/6854502023*$%%'&%$(%))'(&$%%3335<<A>?AHG0,-,.>?:96653-2344:=34=?89B>:<.0% s1:i:179 s2:i:90 NM:i:382 AS:i:93 de:f:0.0679 rl:i:186 cm:i:29 nn:i:0 tp:A:P ms:i:189 ts:A:+

sophielemoine avatar May 31 '21 14:05 sophielemoine

Here is one of the sequences transcript-1111.txt

The corresponding bed record is:

chr7	21346632	22670534	transcript/1111	1000	-	21346632	22670534	0,128,255	10	1,6,2,0,9,10,2,3,4171,513,	0,231414,297920,618322,832129,973035,1096764,1198540,1317866,1323389,

SichongP avatar Nov 10 '21 20:11 SichongP