pollen icon indicating copy to clipboard operation
pollen copied to clipboard

`mygfa` doesn't parse optional CIGAR strings

Open susan-garry opened this issue 2 years ago • 3 comments

There are a number of gfa features that mygfa doesn't account for yet, so I'm not sure how high of a priority this fix should be, but this issue is preventing mygfa from parsing odgi's generated gfa files.

Note that this is the gfa specification that I'm using as a reference: http://gfa-spec.github.io/GFA-spec/GFA1.html

Essentially, mygfa doesn't have the functionality to parse certain CIGAR strings, specifying the alignment of two segments (?). This particular issue shows up wherever an "alignment" string appears, for example in links:

L 1 + 2 + 0M

and paths:

P path1 1+,2+,2+ 0M, 0M

The last column of these lines represents a CIGAR string (or list of CIGAR strings). My understanding is that in either case, this string can be replaced with *:

L       1       +       2       +       *
P       path1   1+,2+,2+         *

Which indicates that the overlap is unspecified. According to the docs, if unspecified, "the CIGAR strings are determined by fetching the CIGAR string from the corresponding link records, or by performing a pairwise overlap alignment of the two sequences." I'm not yet sure what the latter is or how difficult it would be to accomplish, but this suggests that in order to support this, we may want to pre-process gfa files and sort the lines by type so that we parse Path lines after Link lines.

@anshumanmohan , based on your knowledge of overlap, does this sound doable? How much of a priority should this be?

susan-garry avatar Jul 17 '23 23:07 susan-garry

We can parse these, we just don't do a particularly careful job because odgi seems not to either. See https://github.com/cucapra/pollen/pull/80 for more

but this issue is preventing mygfa from parsing odgi's generated gfa files.

Is there a specific point where this seems to be breaking? Could you please say more, or maybe push a minimal breaking example in a branch?

anshumanmohan avatar Jul 17 '23 23:07 anshumanmohan

Happy to be outvoted on this, but I don't think that going over the paths and actually computing overlaps is of interest to us. If that's a feature you're proposing, I'd put that at rather low priority. If something is breaking because of the current treatment, that's high priority for sure!

anshumanmohan avatar Jul 17 '23 23:07 anshumanmohan

I think this is another "YAGNI" situation: let's parse these if (and only if) we know of a specific odgi command that needs to know about them. @susan-garry, did you have a specific command that needs to process overlaps?

sampsyo avatar Jul 18 '23 17:07 sampsyo