gfatools
gfatools copied to clipboard
Defining Walk/W-lines in GFA1
Walk/W-lines are used to keep sequence walks through the graph. There are a few options:
-
Don't add W-lines as we already have Path/P-lines in GFA1. However, there are multiple problems with P-lines. a) the most important one is that it doesn't allow tags as the last field is optional. We need to attach information (see below) in addition to path names. b) naming is not appropriate. Strictly speaking, a path disallows cycles. c) P-lines use comma as a field separator. Comma is a common symbol in reference names.
-
Define W-lines as:
<W-line> <- 'W' <ctgId> <oriSeg>+ <tag>* <oriSeg> <- ( '>' | '<' ) <segId>
We put all the other information in tags. Note that without a list of CIGARs, W-lines are unable to distinguish multiple edges between the same pair of segments. We may define an optional tag to keep the list of CIGARs for general cases.
-
Define W-lines as:
<W-line> <- 'W' <sampleId> <ctgId> <oriSeg>+ <tag>*
-
Define W-lines as:
<W-line> <- 'W' <sampleId> <ctgId> <ctgStart> <ctgEnd> <oriSeg>+ <tag>*
With
<ctgStart>
and<ctgEnd>
, a walk doesn't have to encode an entire contig.
The differences between 2/3/4 are about what fields should be made mandatory. My mild preference may be 4 (I am not even sure); 2/3 are also good to me. I strongly think P-lines are inadequate for a reference model. With an optional tag to keep a list of CIGARs, W-lines are more general. Internally, a GFA parser can parse both W-lines and P-lines into the same data structure.
CC @ekg and @benedictpaten.
First, notes on your objections to P lines. I don't disagree, but I think some of the things you're objecting to might not be issues.
Don't add W-lines as we already have Path/P-lines in GFA1. However, there are multiple problems with P-lines.
It won't be a huge burden to convert between P lines and W lines, but it is good to consider not adding a new object type here. The smaller the delta, the easier it is for VG model systems to adjust over.
a) the most important one is that it doesn't allow tags as the last field is optional. We need to attach information (see below) in addition to path names.
I'm not sure that this is an issue. The current standard for communicating if a path is circular or not is given by a tag: https://github.com/GFA-spec/GFA-spec/issues/23#issuecomment-495298035
I have not been very clear on the GFA spec though, so perhaps there is something I'm missing.
b) naming is not appropriate.
I typically prefix the path name with a sample name. Queries on path names can be prefix queries. For pure hierarchical collections, I don't think this is a problem, but perhaps we could use a specific field separator to indicate this first name division. The alternative is the more complex data model you are proposing for W lines, but it might be better to be specific.
In most cases these paths are directly corresponding to actual sequences in a FASTA file. Thus, we need to be able to define the same entities in the FASTA file (such as multi-part names) or this correspondence will often introduce pain for users. For instance, not every FASTA entry may have a reasonable sample name, and we'll need to add them based on some rule when they aren't available.
I would support a convention for naming and tagging W lines that can be shared in FASTA headers. This is the simplest and most stable option, given that FASTA files will typically be the source for the graphs, or be a linear projection that we make from the W lines.
Strictly speaking, a path disallows cycles.
I don't think this is an issue: https://github.com/GFA-spec/GFA-spec/issues/23#issuecomment-495298035
I admit that no one has implemented this. There is an equivalent that is implemented in the .vg/JSON format.
c) P-lines use comma as a field separator. Comma is a common symbol in reference names.
That's not a problem I'd considered, but is understandable.
I prefer 2:
Define W-lines as:
<W-line> <- 'W' <ctgId> <oriSeg>+ <tag>*
<oriSeg> <- ( '>' | '<' ) <segId>
As a convention (and possibly standard), I think we should add the same tag patterns to the FASTA file / or extract them from the headers.
I would suggest allowing a tag that is the extended CIGAR (or some other representation of the alignment of the sequence to the graph). If this is an embedded path in the graph, or a pure walk, then this can be omitted. If the walk is not pure, then it should have a cigar to define the transformation. This should be as equivalent as possible with GAF format. We might want a standard way of converting the fields in GAF into W line tags.
I'm not a big fan of this, but it's no different than a rGFA + GAF collection. It allows us to keep the full alignments in the same file.
We have not had much use for the CIGARs in P lines in GFA, but we haven't ever been adding any paths to graphs that aren't purely embedded.
I'm not sure that this is an issue. The current standard for communicating if a path is circular or not is given by a tag: GFA-spec/GFA-spec#23 (comment)
It is an issue. If you want to add a tag, you have to put a long list of 0M,0M,...,0M
on the line, or let the parser to be smart enough to distinguish 0M,0M,...,0M
from a tag XY:Z:ABC
. Both are less ideal solutions. P-lines haven't been widely used. Now is good timing to fix this design flaw.
I'm not a big fan of this, but it's no different than a rGFA + GAF collection. It allows us to keep the full alignments in the same file.
As I said in another thread, I think we shall use two different formats for the graph and for the alignments. The graph is static after each release. Alignments are always dynamic. They do share certain properties. That is another reason why W-lines are preferred: the path encoding is the same as in GAF.
you have to put a long list of
0M,0M,...,0M
on the line
Actually this is not correct. We can put a single *
, but W-lines are still preferred as for reference graph, we never need the extra field. The path encoding of W-lines is better in my view.
PS: also *
means "not available", but in this P-line case, we use *
for "it is tedious to write that field out".
As I said in another thread, I think we shall use two different formats for the graph and for the alignments. The graph is static after each release. Alignments are always dynamic.
Do we continue to use VG-GFA (specifically, S, L, P with no overlaps on Ls) for other uses where embedded paths are dynamic? My preference would be to rally around rGFA if it can provide the same functionality.
Presently, we can only represent perfectly embedded paths in VG-GFA due to the implementations not handling CIGARs. This hasn't been a problem for any use, because we can directly embed the entire sequence in the graph.
However, for the uses you are proposing for rGFA (progressive minimal construction, obtaining a pangenome spanning sequence set) it might be important to allow differences between the W lines and the graph. Otherwise, it won't be possible for the W lines to reconstruct actual sequences or provide correct coordinate spaces when considered only within the rGFA context. In light of this limitation, it would be good, I think, to allow for conversion between GAF and W lines. I can't see how it would hurt the system.
The idea behind this is that a Path/Walk is the same as an alignment. This is just a mathematical fact, and it only makes it easier to think about these things.
I hope I'm being clear. At this stage I'm just trying to bring forward these issues. I don't claim to know what's best going forward, but I do know what's worked and what's practical in terms of these models.
I am ok to add something like the cs
tag in minimap2 (this tag is CIGAR+edits) to W-lines if that helps to encode graphs. However, I don't think we should push GFA to be an alignment format.
Ok, the consensus so far. The W-lines are defined by:
<W-line> <- 'W' <ctgId> <oriSeg>+ <tag>*
<oriSeg> <- ( '>' | '<' ) <segId>
New tags:
Tag | Type | Description |
---|---|---|
SM | Z | Sample |
SO | i | Offset on the contig |
LN | i | Walk length |
CS | Z | Edits to the walk (see the minimap2 manpage) |
OG | Z | Comma delimited list of overlap cigars |
This looks good to me. There is one more edge case (sorry I keep coming with these) and I want to verify that it will be supported: Can segId include an offset?
Can segId include an offset?
You can do that anyway, but I think it is better to put offset in a separate tag (currently SO
in rGFA). My experiences is that encoding extra information in plain strings can be confusing and complicated.
Clarify on my previous comment:
You can do that anyway
You can do that anyway, but GFA parsers are likely to ignore the offset information in segId.
You can do that anyway, but GFA parsers are likely to ignore the offset information in segId.
This can be important for encoding split alignments. I'm just continuing to the end of the W line GFA line equivalence.
There are other ways to do it, but this one is pretty nice because of how explicit it is.
Another note. It'd be good if walk lines can be subsets of other walk lines. Defining this relationship in a standard way would allow producing subsets of graphs that remain valid.
On Thu, Jul 18, 2019, 17:31 Heng Li [email protected] wrote:
Clarify on my previous comment:
You can do that anyway
You can do that anyway, but GFA parsers are likely to ignore the offset information in segId.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lh3/gfatools/issues/2?email_source=notifications&email_token=AABDQENVQ5SQPXRPPVLCZRDQACEE7A5CNFSM4IE3ZYUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2I3V5A#issuecomment-512867060, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQELTINSWNAZABQKYJ3DQACEE7ANCNFSM4IE3ZYUA .
It'd be good if walk lines can be subsets of other walk lines. Defining this relationship in a standard way would allow producing subsets of graphs that remain valid.
The better solution is to put contig offset at the SO
tag on W-lines. Let the intervals on the stable sequences define subsetting/nesting. This is also why I prefer option 4 over 2.
After a second thought, I prefer option 4. With option 4, we can
W chr1 0 100 >s1>s2
W chr1 100 200 >s3>s4
It is obvious that a walk describes a subsequence on chr1 in this case. With option 2, we have to
W chr1 >s1>s2 SO:i:0
W chr1 >s3>s4 SO:i:100
In this second format, "chr1" appears to be the identifier of both W-lines, but it is really not a walk identifier. I imagine subgraph will be used a lot. Option 4 (with or without <sampleID>
) will be preferred in future.
A question for you: how will you store a complete graph in GFA (not necessarily rGFA)? The most straightforward way is to split segments into tiny pieces and stick them together with W-lines. For example (I will use option 4 for demo):
S s1 CAGTA
S s2 A
S s3 G
S s4 TTGAC
W GRCh38 chr1 0 11 >s1>s2>s4
W GRCh37 chr1#37 0 11 >s1>s3>s4
This works for arbitrary GFA. If the graph can be encoded with rGFA:
S s1 CAGTA SN:Z:chr1 SO:i:0
S s2 A SN:Z:chr1 SO:i:5
S s3 G SN:Z:chr1#37 SO:i:5
S s4 TTGAC SN:Z:chr1 SO:i:6
W GRCh37 chr1#37 0 11 >chr1:0-5>chr1#37:5-6>chr1:6-11
The rGFA path encoding is longer in this example, but is likely to be shorter when you keep many samples. In that case, we can use >chr1:1000-2000
to represent a list of segments like >s100>s101>s102>s103
.
In both formats, there will be tens of millions of segments and links in the graph (likely hundreds of millions in future), and W-lines will take a lot more space than the bare graph topology. Another possibility is to use edits:
S s1 CAGTAATTGAC
W GRCh38 chr1 0 11 >s1
W GRCh37 chr1#37 0 11 >s1 CS:Z::5*AG:5
W CHM1 foo 0 11 >s1 CS:Z::5*AG:5
This format will be shorter, but it gives us two sets of syntax to represent the same thing. Another issue is that the variant "G" allele is not named.
Once you've settled on a syntax for the walk W
records, could you please open a PR to add it to both GFA1 and GFA2 in https://github.com/GFA-spec/GFA-spec?
@sjackman For now, I am not sure about the best way to encode a dense graph with W-lines. We should not add half-baked features into the official spec. We will issue a PR when W-lines are finalized.
@lh3 sorry, I missed your last prompt about W lines. I should reply carefully, but I think it makes sense to consider using W lines to build up a semi-compressive scheme over the haplotype/path set. It seems this is what you are alluding to.
@ekg what is the exact W-lines in your mind?