minigraph icon indicating copy to clipboard operation
minigraph copied to clipboard

minigraph requires GFA 1.0 overlaps, which are optional in the spec

Open edawson opened this issue 4 years ago • 4 comments

The GFA specs state that overlaps for L lines are optional, but minigraph seems to require these. I think this happens because there is no way to avoid parsing a CIGAR string in this code block.

For GFA 2.0, the "*" placeholder is used to denote a lack of an overlap CIGAR. GFA 1.0 doesn't specify a placeholder, just that the field is optional. I have always assumed that a GFA with no CIGAR in the L/E lines implies a non-overlap (i.e., a CIGAR of 0M).

Would it be possible to adjust the default condition so that the parser can handle all valid GFA 1 files?

edawson avatar Oct 13 '19 04:10 edawson

Duplicate of #1.

Minigraph is more for mapping to a reference graph. I don't think a reference graph should allow overlaps. Also, it is tricky to work with overlaps. It will take time to implement the feature. Minigraph may support overlaps in future, but that won't happen soon unfortunately.

lh3 avatar Oct 13 '19 15:10 lh3

I think these two issues are related but maybe not duplicates. The issue isn't the graph structure but its format in this case.

I have a graph with no overlaps (CIGAR 0M), but the CIGAR is not included in the Link lines:

L       2112999 +       2113002 +     

This is valid GFA1 and my graph is a reference graph (i.e., constructed from a reference genome backbone with reference-relative variation added). minigraph just refused to parse it because it expects the sixth field (CIGAR) to be present.

Edit: I agree though that reference graphs should not have overlaps. That would make everything a lot harder!

edawson avatar Oct 13 '19 18:10 edawson

Sorry for misreading your question. My initial intention was to require CIGAR because L-lines may have tags. Making CIGAR optional will complicate tags. In addition, 0M doesn't waste much space anyway. I didn't realize that GFA1 makes this field optional, which IMO is not ideal.

lh3 avatar Oct 22 '19 14:10 lh3

Making CIGAR optional will complicate tags

Yes - I can file an issue on the spec to clarify whether "CIGAR optional" means no field present or CIGAR field == empty string. I have only ever had GFA files where, if tags are present, there is also a CIGAR. There's some room for clarification on that.

I think a good fix would be to note in the README/docs:

  1. That input is GFA 1.0 (or rGFA), and
  2. That CIGAR strings are required and overlaps (i.e., CIGAR != "0M") are discouraged
  3. That tags are permitted IFF a CIGAR is present.

I think that would sufficient to prevent any users from falling into traps, without having to change the source code.

edawson avatar Oct 24 '19 15:10 edawson