ga4gh-schemas icon indicating copy to clipboard operation
ga4gh-schemas copied to clipboard

graph subsetting operations and serialization format

Open ekg opened this issue 9 years ago • 29 comments

I'd like to query a part of the graph using coordinates against an arbitrary reference path encoded in it.

I have a implementation of this kind of feature that can serve as a conceptual starting point. In vg you'd do this like vg find -n 314159 -c 10 x.vg >q.vg, and q.vg would then contain a subgraph with all nodes and edges reached in a breadth first search of 10 steps starting at node 314159. I use a context size defined in nodes rather than edges because it is simpler in many situations, and in practice I construct graphs with a maximum node size of 50bp using vg construct ... -m 50, but it seems possible to do this using an actual base distance rather than a number of steps in a BFS. You can also do vg find -p chr19:100-200 x.vg to get the graph overlapping a particular range in any named path that has been stored in the graph (in this case chr19). vg find can also take a specific kmer and return the subgraph associated with that kmer (which may not be fully-connected).

There are other typical operations that I've found useful when generating subsets of variation graphs, which are perhaps best summarized by the interface to vg find:

usage: vg find [options] <graph.vg> >sub.vg
options:
    -n, --node ID          find node, return 1-hop context as graph
    -f, --edges-from ID    return edges from node with ID
    -t, --edges-to ID      return edges from node with ID
    -k, --kmer STR         return a graph of edges and nodes matching this kmer
    -T, --table            instead of a graph, return a table of kmers
    -c, --context STEPS    expand the context of the kmer hit subgraphs
    -s, --sequence STR     search for sequence STR using --kmer-size kmers
    -j, --kmer-stride N    step distance between succesive kmers in sequence (default 1)
    -z, --kmer-size N      split up --sequence into kmers of size N
    -C, --kmer-count       report approximate count of kmer (-k) in db
    -p, --path TARGET      find the node(s) in the specified path range TARGET=path[:pos1[-pos2]]
    -P, --position-in PATH find the position of the node (specified by -n) in the given path
    -r, --node-range N:M   get nodes from N to M
    -d, --db-name DIR      use this db (defaults to <graph>.index/)

My point is that the natural way to interrogate the graph involves queries over kmers, sequences, path positions, node ids, edge ids, and node id ranges. The return from such a query should be a subgraph, and ideally one with a 1:1 mapping to components in the source graph.

Things are pretty straightforward in variation graphs (as in vg). These are collections of three types of objects: (1) nodes with sequences, (2) edges between node ends representing allowed linkages, and (3) paths over nodes and edges which define coordinate systems and names. Making subsets of this structure is trivial. You just collect the nodes, connected edges, and path annotations to create a subgraph that is a strict subset of the components of the original graph.

But how do we generate subgraphs of a side graph?

There seem to be some conceptual hurdles to generating subsets from the side graph model. Useful side graph subsets are not directly equivalent to subsets of the components original side graph. If nodes are long (and some in the side graph are effectively as long as whole chromosomes) then we need to generate new synthetic nodes that start and end within those from the graph we are querying.

  1. As joins are defined with node-relative coordinates, then the entire coordinate system within the subgraph needs to be modified relative to the original graph in order to maintain a consistent structure.
  2. Alternatively, we could maintain offsets relative to the original nodes in our subgraph representation to allow mapping back into the original coordinate space.
  3. Yet another possibility would be to define the subset as a side graph built on the original graph. This would seem to introduce new problems by making it impossible to relate sides within this directly back to the original graph.

These are not insignificant issues. We can pick a solution among these and it will work. However, it will add nontrivial complexity to the model. Subsetting the graph is just about the most basic thing you can do, and it should not present a conceptual hurdle to users and developers.

This is not just a theoretical critique. Some months ago I began to implement (2) but stopped as it required the addition of additional fields defining relative offsets to "real" nodes to each component of the graph. The semantics of the graph would then change whether or not a node was in "offset" space. The coordination of these offsets seemed extremely likely to generate bugs. (As described above, I resolved the problem of graph subsets by breaking up the graph into pieces of a consistent maximum size.)

The side graph makes great sense to me as a way to define a coordinate system, but very little sense as the basis for the data model that's used to store, query, and transmit the graph.

There is a reasonable alternative. We can maintain a side graph coordinate system to provide distinct names for all the sequences and paths in the graph. However, this can be a purely virtual thing, in that we can allow any underlying graph structure to be returned from queries. Provided the nodes and edges in a sequence/variation graph are annotated with their coordinates in the side graph, we will get all the benefits of the side graph without the awkwardness of defining standard algorithms to manipulate it.

ekg avatar Apr 02 '15 13:04 ekg

I think you're missing an option (4) in your list, which is to just work in the coordinate space of the original graph, even when looking only at a subgraph. The subgraph isn't made out of the same types as the full graph (since e.g. two regions of the same Sequence can't be different Sequences in the subgraph), but do regions and reference sets really need to support the same operations?

My proposed subgraphRadius() method in #269 just returns the Sequences and Joins that are touched by the subgraph you are asking about. So if part of a Sequence is involved in your subgraph, you get the full Sequence record (which isn't any larger than a Segment would be).

Since that makes it a bit difficult to know what part of the Sequences you get back was actually in the range you were asking about (without walking the returned graph again yourself), we might want a type that defines a subgraph, consisting of multiple Segments (in the coordinate space of the underlying full graph) and some Joins.

adamnovak avatar Apr 02 '15 18:04 adamnovak

Conceptually, I more like to think a side graph as a graph with each base as a node. Each node/base in this graph has a unique and persistent label such as chr1:1012 (perhaps plus orientation). When we do a vg-like traversal, the command line would be sg find -s chr1:1012:5 -c 10. This means we start from the 5'-end of base chr1:1012 and extend over at most 10 bifurcations with BFS. Or we may replace -c 10 with -l 5 to extend at most 5bp. Suppose we have a SNP at chr1:1014. With (2), the output should be:

S={"chr1_1012_1017":AGCCG,"snp1":T}
J={<"snp1:0:5","chr1_1012_1017:1:3">,<"snp1:0:3","chr1_1012_1017:3:5">}

Here as this 5bp sequence is named as chr1_1012_1017, each base on it has the same label as the reference graph.

@adamnovak's solution (4) may be better (his comment just popped up). We can change the definition of S a little and get such a subgraph:

S={(chr1,1012,AGCCG),(snp1,0,T)}
J={<"snp1:0:5","chr1:1013:3">,<"snp1:0:3","chr1:1015:5">}

Now S is a set of 3-tuples (name,offset,sequence). (S,J) describes the local topology.

EDIT: simplified the example a little bit.

lh3 avatar Apr 02 '15 19:04 lh3

In addition, @ekg, how can we know node 314159 in vg is a reference or a named allele? How should we specify the reference coordinate to search its local topology? When we add new variants or merge vg graphs, will node 314159 be split? i.e., is node immutable with incremental graph construction?

lh3 avatar Apr 02 '15 19:04 lh3

On Thu, Apr 2, 2015 at 12:21 PM, Heng Li [email protected] wrote:

Conceptually, I more like to think a side graph as a graph with each base as a node. Each node/base in this graph has a unique and persistent label such as chr1:1012 (perhaps plus orientation). When we do a vg-like traversal, the command line would be sg find -s chr1:1012:5 -c 10. This means we start from the 5'-end of base chr1:1012 and extend over at most 10 bifurcations with BFS. Or we may replace -c 10 with -l 1000 to extend at most 1000bp. Suppose we have a SNP at chr1:1512. With (2), the output should be:

S={"chr1_1012_2012":1kb-sequence,"snp1":T} J={<"snp1:0:5","chr1_1012_2012:499:3">,<"snp1:0:3","chr1_1012_2012:501:5">}

Here as this 1kb-sequence is named as chr1_1012_2012, each base on it has the same label as the reference graph.

@adamnovak https://github.com/adamnovak's solution (4) may be better (his comment just popped up). We can change the definition of S a little and get such a subgraph:

S={(chr1,1012,1kb-sequence),(snp1,0,T)} J={<"snp1:0:5","chr1:1511:3">,<"snp1:0:3","chr1:1513:5">}

Now S is a set of 3-tuples (name,offset,sequence). (S,J) describes the local topology.

Yes, I like Adam's suggestion too - a subgraph in a sidegraph is a set of segments and a set of joins, such that all the joins connect positions within the set of segments (no edges to nowhere). Adam, I suggest changing your pull request to reflect that.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-89015383.

benedictpaten avatar Apr 03 '15 01:04 benedictpaten

This sounds really sensible.

On Thu, Apr 2, 2015 at 6:01 PM, Benedict Paten [email protected] wrote:

On Thu, Apr 2, 2015 at 12:21 PM, Heng Li [email protected] wrote:

Conceptually, I more like to think a side graph as a graph with each base as a node. Each node/base in this graph has a unique and persistent label such as chr1:1012 (perhaps plus orientation). When we do a vg-like traversal, the command line would be sg find -s chr1:1012:5 -c 10. This means we start from the 5'-end of base chr1:1012 and extend over at most 10 bifurcations with BFS. Or we may replace -c 10 with -l 1000 to extend at most 1000bp. Suppose we have a SNP at chr1:1512. With (2), the output should be:

S={"chr1_1012_2012":1kb-sequence,"snp1":T}

J={<"snp1:0:5","chr1_1012_2012:499:3">,<"snp1:0:3","chr1_1012_2012:501:5">}

Here as this 1kb-sequence is named as chr1_1012_2012, each base on it has the same label as the reference graph.

@adamnovak https://github.com/adamnovak's solution (4) may be better (his comment just popped up). We can change the definition of S a little and get such a subgraph:

S={(chr1,1012,1kb-sequence),(snp1,0,T)} J={<"snp1:0:5","chr1:1511:3">,<"snp1:0:3","chr1:1513:5">}

Now S is a set of 3-tuples (name,offset,sequence). (S,J) describes the local topology.

Yes, I like Adam's suggestion too - a subgraph in a sidegraph is a set of segments and a set of joins, such that all the joins connect positions within the set of segments (no edges to nowhere). Adam, I suggest changing your pull request to reflect that.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-89015383.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-89100838.

haussler avatar Apr 03 '15 03:04 haussler

I think you're missing an option (4) in your list, which is to just work in the coordinate space of the original graph, even when looking only at a subgraph.

@adamnovak, it seems this is the same as my:

(2) Alternatively, we could maintain offsets relative to the original nodes in our subgraph representation to allow mapping back into the original coordinate space.

I'm starting from the same premise, but focusing on the fact that we need a way to maintain the same coordinate space even if we are working on a tiny fragment which is not strictly identical to any piece of the larger graph.

This introduces some practical problems, such as when I align something to the subgraph, I'll need to transform its mapping coordinates into the global space to put it in the context of the entire graph in my particular version of the reference. Also, it makes merging subgraphs from the same system cumbersome. If I only ever work with whole nodes in my subsets, then I can just take the union of nodes and edges (using internal ids as identifiers) in a set of graphs and make a larger one that is strictly part of the source graph.

@lh3 answered this concern by noting that nodes could be single bases. If so, then we could say we are working with a true subset. There are some practical reasons not to force nodes to be single bases, but in the long term they may tend to become smaller. If we collapse successive bases into single nodes when there are no bifurcations, we see about 11bp per node in 1000Gp3 + GRCh37. In other species with large effective population sizes (such as Anopheles), we already have higher rates of variation and consequently smaller nodes in the graph.

The subgraph isn't made out of the same types as the full graph (since e.g. two regions of the same Sequence can't be different Sequences in the subgraph), but do regions and reference sets really need to support the same operations?

Yes, they should support the same operations, and they should have the same serialization format. Having different models will limit the kinds of synergies that are possible. For example, if the two are the same, then I can extract a portion of a graph, write it to another index, and then treat it like a new reference without any problem. If they are different, then we need conversions between the two models for this to be possible.

ekg avatar Apr 03 '15 10:04 ekg

@lh3 asked:

how can we know node 314159 in vg is a reference or a named allele?

(In vg) we can know if it's a named allele if it resides in any paths. A path is just a series of nodes and edges. Each node can have many (or no) names. Named paths are effectively annotations on nodes.

As I understand, paths of this form can be equivalent to sides. I can encode a side graph coordinate system in them without any change to the underlying graph in order to maintain a common coordinate space with other graphs.

When we add new variants or merge vg graphs, will node 314159 be split? i.e., is node immutable with incremental graph construction?

A node id is not meant to be something stable or canonical. It's just an internal thing that may be useful for working on a particular version of the reference system. If new things are added, then it is likely to change. I apologize that this might have made things confusing in this example.

Although it is not required, in my implementation the node id is exploited as a partially-ordered coordinate. This is extremely helpful in a number of situations and has a natural relationship to the coordinate systems that have been discussed. But, the rank in the partial order can't be a stable identifier if the graph changes.

ekg avatar Apr 03 '15 11:04 ekg

What I like about side graphs is that the components are stable as you add more variation. We can all refer to well-known sequences which we call References. In Erik's vg graphs his nodes are not stable, they can be split. He gets stability in his paths. i.e. for him References are paths. As he says, the nodes are internal. If Erik restricts himself to non-overlapping Reference paths then the models are very similar. Erik can keep an index for each of his nodes as to which path it is in, with which offset, and hence provide a side graph interface by returning these not his nodes, and can read in side graphs and split them into his internal nodes keeping the longer side sequences as paths.

I don't see Erik's problems with subgraph operations on side graphs. There are two solutions, either (1) having offsets as Adam and Heng propose, or (2) just returning a true subgraph including whole chromosome objects.

(1) I do quite like the suggestion of sequences in the side graphs having an offset, i.e. being a segment. I think this corresponds to what Heng is saying. If we go this way, why not extend this to all sequences? From memory Adam added something like this to the definition of Reference, so he could refer to a segment of a chromosome in a local ReferenceSet. This allows coordinates to be stable within subgraph primary sequence objects that do not extend the length of the whole chromosome.

(2) The other alternative is to return as a subgraph of a side graph a true subset of sequences and joins. This would mean that you get a sequence object for the whole primary reference chromosome when you ask for a local subgraph around a locus. But so what? It is just a structure, the same size as a sequence object for a SNP. You don't need to get the entire 200 million base sequence corresponding to the object. In this world all radius/diameter operations would be in base units. I would not allow the -c option, which would get all the SNPs all the way along the whole chromosome. I think this is natural and correct - what you care about is base distance context/size, not graph complexity. Why care about how many graph steps you are away? This way combining subgraphs is natural - you just take the union of the set of sequences and the union of the set of joins. When you want to realise the base strings you could either be lazy and get whole chromosomes, or be a bit more sophisticated and keep track of the window of interest and retrieve just the segment of interest and keep track of coordinates as in (1), but this would be a client choice. Erik's client working internally in vg would support the local representation.

I slightly prefer (2) as conceptually simpler. Seen another way, (1) exposes some internal workings that are more complex and not strictly necessary.

Finally, Erik's point about the nodes having a natural order is important. For him this gives the partial order on his graph. For the side graph we also wanted this to give unique natural paths back to primary references. For this reason, I would require sequences in a graph to have a positive integer identifier, to enforce this order.

On 3 Apr 2015, at 12:03, Erik Garrison [email protected] wrote:

@lh3 https://github.com/lh3 asked:

how can we know node 314159 in vg is a reference or a named allele?

(In vg) we can know if it's a named allele if it resides in any paths. A path is just a series of nodes and edges. Each node can have many (or no) names. Named paths are effectively annotations on nodes.

As I understand, paths of this form can be equivalent to sides. I can encode a side graph coordinate system in them without any change to the underlying graph in order to maintain a common coordinate space with other graphs.

When we add new variants or merge vg graphs, will node 314159 be split? i.e., is node immutable with incremental graph construction?

A node id is not meant to be something stable or canonical. It's just an internal thing that may be useful for working on a particular version of the reference system. If new things are added, then it is likely to change. I apologize that this might have made things confusing in this example.

Although it is not required, in my implementation the node id is exploited as a partially-ordered coordinate. This is extremely helpful in a number of situations and has a natural relationship to the coordinate systems that have been discussed. But, the rank in the partial order can't be a stable identifier if the graph changes.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-89254418.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

richarddurbin avatar Apr 03 '15 12:04 richarddurbin

@ekg your (2) and @adamnovak's (4) are different. In my understanding, you offset everything, including sequences, joins and the coordinate system. You need to apply offset everywhere when we translate between the whole graph and the subgraph. @adamnovak just marks a sequence in the subgraph as a subsequence of the whole graph. We don't need to offset joins and most importantly we use the same coordinate system. @richarddurbin wants to go a step further by even not specifying the offset.

@richarddurbin When we retrieve a subgraph, we would usually like to retrieve subsequences included in this subgraph. Having an offset field will make this easier. Of course you can retrieve subsequence later with your proposal but getting the coordinates needs extra work.

lh3 avatar Apr 03 '15 13:04 lh3

@lh3 I think there is a misunderstanding. I'm saying that if you have a serialized version of a subset the side graph, you'll want to specify the source offset in the original graph, or it won't be possible to project things from the subset back into the general graph coordinate space. This offset effectively propagates across all the entities in the subgraph, whether you record it everywhere or not. I can't use the original coordinates on a subset of the graph any more than we can use the whole-genome relative coordinates when working with substrings from the reference genome. So as far as I can tell, this is in principle exactly the same as what @adamnovak is suggesting.

The difference is one of perspective. In practice "just marking the sequence as a subsequence" must be accounted for within each method that works with the graph entity. This adds complexity without obvious benefit. Now every element would need to have an optional offset field, or we need to have two different types of elements, one with offsets and one without.

ekg avatar Apr 03 '15 14:04 ekg

@richarddurbin observes that

If Erik restricts himself to non-overlapping Reference paths then the models are very similar.

It seems there is a conflict here because I would like to encode many different overlapping coordinate systems in the graph. A simple case would be two canonical reference sequences. Encoding both in the same context is beneficial because it means that everything on the graph can be surjected into either reference. A more complex situation would be an overlapping set of references relative to exons, gene starts, and other genomic features. Both of these needs can be met provided we have named paths with annotations. I've implemented this in vg, and I've been discussing the correct way to extend GFA format to support it with @lh3 and @jts.

How can I encode this in a side graph? It does not seem possible without adding another annotation mechanism. As far as I can tell, such a mechanism would be equivalent to vg paths, which are sufficient to define any number of virtual side graphs on a sequence graph.

The current approach mixes the coordinate system and the graph representation. The result is less flexible than if we decoupled them.

ekg avatar Apr 03 '15 14:04 ekg

@lh3 The only sequences I can see as a problem for obtaining the full sequence with my proposal are the well-known primary chromosome references, and I am happy to cache them locally. Then surely my proposal is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph you need to remap to a new set of paths. You have implemented path mapping, which is powerful, but I suggest that this is not simpler than having an offset on sequence segments as @lh3 and @adamnovak propose. In any case, I still think that given you have this, mapping between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li [email protected] wrote:

@ekg https://github.com/ekg your (2) and @adamnovak https://github.com/adamnovak's (4) are different. In my understanding, you offset everything, including sequences, joins and the coordinate system. You need to apply offset everywhere when we translate between the whole graph and the subgraph. @adamnovak https://github.com/adamnovak just marks a sequence in the subgraph as a subsequence of the whole graph. We don't need to offset joins and most importantly we use the same coordinate system. @richarddurbin https://github.com/richarddurbin wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a subgraph, we would usually like to retrieve subsequences included in this subgraph. Having an offset field will make this easier. Of course you can retrieve subsequence later with your proposal but getting the coordinates needs extra work.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

richarddurbin avatar Apr 06 '15 07:04 richarddurbin

I would agree with @richarddurbin https://github.com/richarddurbin that because of its simplicity, the "simple subgraph" of a side graph that is merely a subset of the sequences and a subset of the joins will be very popular. As he points out, you can combine these by taking a simple union. Sure they include some extra bases due to the fact that you always have whole sequences, not just the parts you need, but this small drawback is outweighed by their simplicity. I think we should definitely support these, because it is inevitable that they will be popular.

@ekg https://github.com/ekg is bringing up other very important issues that we should discuss. These revolve around operations that go back and forth between paths and subgraphs. He is right that another important type of subgraph, commonly used in graph manipulations, is the the minimal subgraph supporting a given set of paths. This could also be given as a simple subgraph (i.e. set of sequences and joins), allowing as above this subgraph structure to also contain extra bases as needed so as not to break sequences.

However, @ekg https://github.com/ekg points out another issue, which is the need for an operation to transform from one set of reference sequences to another, either virtually or literally. He mentions conversion from one reference version to another. I also ran into this need when working with canonical forms and isomorphism. Starting with a given side graph, and a set of disjoint paths on that side graph, one would like to be able to define a new side graph in which the roles of sequences and paths were swapped: what originally were paths become sequences, and what originally were sequences become paths. Note: here we must be able to convert explicit joins to implicit joins and vice-versa. I think somebody will inevitably want to support this too, and efficiency will perhaps be an issue.

The deeper issue, I think also broached by @ekg https://github.com/ekg, is what to do when the paths are not disjoint, but rather share a base (i.e. node). If you require that what was originally a path become a sequence, this forces a split. When a node is split in a graph, all its adjacent edges are copied, and when two nodes are merged, any pair of repeated edges that are produced are merged as well. These operations are not recursive. We could imagine somebody might be supporting them on side graphs at some time in the future, and we might want to make their job easier. When is the time to discuss this?

-D

On Mon, Apr 6, 2015 at 12:39 AM, Richard Durbin [email protected] wrote:

@lh3 The only sequences I can see as a problem for obtaining the full sequence with my proposal are the well-known primary chromosome references, and I am happy to cache them locally. Then surely my proposal is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph you need to remap to a new set of paths. You have implemented path mapping, which is powerful, but I suggest that this is not simpler than having an offset on sequence segments as @lh3 and @adamnovak propose. In any case, I still think that given you have this, mapping between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li [email protected] wrote:

@ekg https://github.com/ekg your (2) and @adamnovak < https://github.com/adamnovak>'s (4) are different. In my understanding, you offset everything, including sequences, joins and the coordinate system. You need to apply offset everywhere when we translate between the whole graph and the subgraph. @adamnovak https://github.com/adamnovak just marks a sequence in the subgraph as a subsequence of the whole graph. We don't need to offset joins and most importantly we use the same coordinate system. @richarddurbin https://github.com/richarddurbin wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a subgraph, we would usually like to retrieve subsequences included in this subgraph. Having an offset field will make this easier. Of course you can retrieve subsequence later with your proposal but getting the coordinates needs extra work.

— Reply to this email directly or view it on GitHub < https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152>.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-89960344.

haussler avatar Apr 06 '15 16:04 haussler

I guess I am still confused about the design decisions forced by side graphs. If a labeled sequence graph (nodes have sequences/labels, edges are linkages) with named paths is enough to provide a general, stable coordinate system, then what are we gaining by introducing the novel side graph concept?

The former is extremely easy to explain and has an existing open source implementation which satisfies the basic requirements driving this group's work. It also has a deep relationship with a large body of work on assembly graphs and multiple sequence alignments. The latter has no existing implementation and it remains to be seen if it is a sensible way to arrange things.

Side graphs seem to complicate the process of adding multiple coordinate systems to the same sequence graph. They also appear to make it more complex to extract regions from the graph in that the resulting subgraphs are not semantically equivalent to subsets of the original side graph. However, they do seem to be an ideal way to generate the names of paths in a labeled graph, which is something that has not been done before.

In summary, I'm skeptical that we should use side graphs as a literal interchange format.

Perhaps we should table this discussion and come back to it when there is a working side graph implementation that we can use to align genomes. I suspect the process of implementing one will explain a lot of my concerns in a more convincing way than I can here. If it can be shown that they are a more effective way to work with pan-genome representations, I will be happy to change what I am doing to match. We have to convince a community to follow this effort, and I don't think it will if we aren't proposing something sensible.

ekg avatar Apr 07 '15 18:04 ekg

@richarddurbin / @lh3 / @anovak / @haussler - I think it would be better to formulate a sub-graph of a side graph in practice as a set of segments and joins, rather than a set of sequences and joins.

The pros of this are:

(1) For a given position + radius query the client gets back exactly the subsequences within the radius - there is no need to recompute this on the client.

(2) We can use segment+join subgraphs to describe annotations, such as exons or introns, that include the variation that's present in the population. The sequence+join subgraph, which does not bound the subsequences, does not make it easy to express this.

The cons (as I see it) are: A small increase in the amount of data that needs to be transferred between server and client, because you need to specify the additional subsequence data.

Did I miss anything compelling as to why we wouldn't make the job of the client a little easier?

On Mon, Apr 6, 2015 at 12:38 AM, Richard Durbin [email protected] wrote:

@lh3 The only sequences I can see as a problem for obtaining the full sequence with my proposal are the well-known primary chromosome references, and I am happy to cache them locally. Then surely my proposal is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph you need to remap to a new set of paths. You have implemented path mapping, which is powerful, but I suggest that this is not simpler than having an offset on sequence segments as @lh3 and @adamnovak propose. In any case, I still think that given you have this, mapping between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li [email protected] wrote:

@ekg https://github.com/ekg your (2) and @adamnovak < https://github.com/adamnovak>'s (4) are different. In my understanding, you offset everything, including sequences, joins and the coordinate system. You need to apply offset everywhere when we translate between the whole graph and the subgraph. @adamnovak https://github.com/adamnovak just marks a sequence in the subgraph as a subsequence of the whole graph. We don't need to offset joins and most importantly we use the same coordinate system. @richarddurbin https://github.com/richarddurbin wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a subgraph, we would usually like to retrieve subsequences included in this subgraph. Having an offset field will make this easier. Of course you can retrieve subsequence later with your proposal but getting the coordinates needs extra work.

— Reply to this email directly or view it on GitHub < https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152>.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-89960344.

benedictpaten avatar Apr 07 '15 18:04 benedictpaten

@benedictpaten What are the benefits of using side graphs as a serialization format?

What can we do with a side graph that we can't with a labeled graph where labels may be sequences, names, or other annotations?

The labeled graph does not have the restriction of being hierarchical. This is a fine property for a coordinate system but I think we should be cautious about making the side graph the basic currency of the system because this will limit applications. It may also add complexity to the processing of the reference system and subsets of it.

ekg avatar Apr 07 '15 20:04 ekg

Hi Gents,

I seem to be getting looped in as anovak (Anthony Novak), so Adam Novak may not be getting these notifications. Cheers

anovak avatar Apr 07 '15 20:04 anovak

So sorry - will use correct handle in future, which is @adamnovak

On Apr 7, 2015 1:58 PM, "anovak" [email protected] wrote:

Hi Gents,

I seem to be getting looped in as anovak (Anthony Novak), so Adam Novak may not be getting these notifications. Cheers

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-90730650.

benedictpaten avatar Apr 08 '15 00:04 benedictpaten

On Tue, Apr 7, 2015 at 1:48 PM, Erik Garrison [email protected] wrote:

@benedictpaten https://github.com/benedictpaten What are the benefits of using side graphs as a serialization format?

I think:

(1) Sufficiency. Ability to properly represent connections between the ends of molecules. This is not a unique property of side graphs. String graphs, side graphs, insert graphs, break point graphs, bidirected graphs - all these graphs can be used to represent the fact that DNA strings have two ends (5' and 3'), and distinguish between connections to each end. I think @lh3 made this point earlier in the thread, but graphs in which nodes are labeled with sequences and edges are simple pairs or pair-sets do not have that property. I interpret "labeled graph" to fall into this latter category.

(2) Compactness. If we assume that we ultimately want a graph with all "common" (I appreciate this is a nebulous term) snps, indels and structural variations then, assuming snps outnumber everything else, we end up with a graph with around 50 million sequences, and 100 million joins. Not too bad. If we break every time we have a SNP, as in a string graph, then we have 4x more sequences, roughly speaking. Perhaps I'm way off, not sure.

(3) Natural extension. The existing reference is a special case of the side graph, containing just the primary reference sequences. This gives us a nice backwards compatible path from where we are today to a richer graph world tomorrow.

(4) Translatable. There are some simple algorithms to go from a string graph to a side graph and back again, resulting in the same original string graph. @lh3 made this point, and exposed the issue with zero length sequences (which are not allowed in side graphs), that were present for insert graphs and which made this problematic.

(5) Equivalence testing. @haussler has pointed out that ranked side graphs can be tested for equivalence quickly - which is a very nice property to have. No general subgraph isomorphism problem in that case.

Disagree? Did I miss some?

What can we do with a side graph that we can't with a labeled graph where

labels may be sequences, names, or other annotations?

The labeled graph does not have the restriction of being hierarchical. This is a fine property for a coordinate system but I think we should be cautious about making the side graph the basic currency of the system because this will limit applications. It may also add complexity to the processing of the reference system and subsets of it.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-90728524.

benedictpaten avatar Apr 08 '15 01:04 benedictpaten

@ekg I have not thought of multiple coordinate systems. It is a good catch. However, I am not sure about the use case of that. It would be overcomplicated if we encode multiple versions of the genome. If you are thinking of the coexistence of genomic and transcriptomic coordinates, I think it is equally nature to annotate a side graph.

Side graph and properly path-labeled string graph are conceptually equivalent and can be easily converted to each other. A major difference is that side graph has one coordinate system throughout, but labeled string graph has two, the native unstable node coordinate and the added stable path coordinate. The two coordinate systems lead to two concerns. Firstly, most of time we don't care about the unstable node coordinates. We want to query with "chr1:12345" and get results back in a similar way. Having the node coordinate in the way complicates some operations. We don't have this problem with side graph. Secondly, because the path coordinates are not native, we can generate illegal paths in various ways (e.g. looped paths, disconnected paths, overlapping paths and nodes not on any paths) and end up with an illegal coordinate system. In contrast, side graph disallows the many ways to shoot ourselves in the foot. This is to me a critical advantage. I always believe the right representation should naturally impose the constraints we want to apply.

I can now read a side graph into C objects (see lh3/sdg). The implementation is more complex than an string/assembly graph but probably as complex as path-labeled string graph. In addition, you can implement a side graph on top of your vg data structure once vg also supports orientation. You just hide the internal node IDs away from endusers.

lh3 avatar Apr 08 '15 02:04 lh3

The question boils down to: do we allow a base to be identified with two or more labels at the same time (e.g. chr1v1:12345 and chr1v2:23456)? If the answer is no, side graph is clearly the right representation because it is naturally imposes this constraint. If the answer is yes, path+string graph shows its strength because the asymmetry of sequence and path in a side graph is problematic.

lh3 avatar Apr 08 '15 03:04 lh3

@lh3 Surely the right thing to do with respect to alternate reference systems is to have a function that maps a side graph plus a set of disjoint paths on that side graph representing a new coordinate system to a new side graph representing the same topology but based on the new coordinate system. I assert that the complexity in doing this is comparable to the complexity of handling path coordinates on an @ekg style sequence graph.

@ekg It would be interesting if you could represent the GRCh38 reference chromosomes as paths in your current 1000 Genomes vg graph, and hence map all the 1000 Genomes phase 3 variation into GRCh38? What are the issues in doing that? My guess is that a lot of it will work well, but that there will be complications at places where the chromosomes diverge.

Richard

On 8 Apr 2015, at 04:52, Heng Li [email protected] wrote:

The question boils down to: do we allow a base to be identified with two or more labels at the same time (e.g. chr1v1:12345 and chr1v2:23456)? If the answer is no, side graph is clearly the right representation because it is naturally imposes this constraint. If the answer is yes, path+string graph shows its strength because the asymmetry of sequence and path in a side graph is problematic.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-90795876.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

richarddurbin avatar Apr 08 '15 09:04 richarddurbin

A side graph has a native coordinate system. Accessing with this coordinate system is much easier than alternative systems. This asymmetry seems awkward. Each path coordinate systems in a string graph is equivalent. It is a more natural way to describe parallel coordinates. @haussler talked about transforming one side graph coordinate to another, but in my understanding, the transformed side graph still has one coordinate system. We don't have two parallel systems.

It is worth noting that liftover between genome builds is nontrivial. UCSC and NCBI are providing different mappings. Parallel coordinate systems may sound easier with path+string graph in theory, but in practice, I doubt any solutions so far work.

lh3 avatar Apr 08 '15 11:04 lh3

I agree with @lh3 that path+string graphs will not solve the difficulties of the liftover problem. However, they will at least enable us to represent particular mappings between different assemblies in one system.

ekg avatar Apr 08 '15 12:04 ekg

Yes - the side graphs come with a single reference coordinate system. I think this is OK/good.
We may want to transform coordinate systems, but as all agree this is a complex issue and I think it is fine for it to be a separate operation that generates a new side graph.

We still need paths and the ability to do things with them.

Richard

On 8 Apr 2015, at 12:57, Heng Li [email protected] wrote:

A side graph has a native coordinate system. Accessing with this coordinate system is much easier than alternative systems. This asymmetry seems awkward. Each path coordinate systems in a string graph is equivalent. It is a more natural way to describe parallel coordinates. @haussler https://github.com/haussler talked about transforming one side graph coordinate to another, but in my understanding, the transformed side graph still has one coordinate system. We don't have two parallel systems.

It is worth noting that liftover between genome builds is nontrivial. UCSC and NCBI are providing different mappings. Parallel coordinate systems may sound easier with path+string graph in theory, but in practice, I doubt any solutions so far work.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-90891156.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

richarddurbin avatar Apr 08 '15 16:04 richarddurbin

As I understand it, the controversy of

subgraph = set of sequences and joins

vs

subgraph = set of segments of sequences and joins

revolves around where and how the coordinate transforms are done. I'm afraid if we specify subgraph = set of segments of sequences and joins we start down a slippery slope. It will bake in the coordinate transform required when going from having a sequence establish the basic coordinate system to having a segment of that sequence establish the basic coordinate system. However, almost as frequently we will want to concatenate two sequences and make a single coordinate system (i.e. sequence) for that concatenation. Most generally, we will want to concatenate a series of segments and make a single coordinate system for that object. I think we should make all these coordinate transforms, including sequence -> segment, one and the same, for simplicity. That means having (just) two distinct notions:

  1. (simple) subgraph = set of sequences and joins taken from an existing side graph. Defining a simple subgraph is an operation that does not create a new side graph, it only references part of an existing side graph.
  2. subgraph transform = creation of a new side graph that has its own new coordinate system but is demonstrably homomorphic to a subgraph of (the most granular form of) an existing side graph.

Here by "most granular form" I mean the side graph obtained by making every sequence have length 1. By "demonstrably homomorphic to" I mean that there is a defined procedure for going from a side s in the new side graph S to the corresponding side s' in the original side graph S' that satisfies the conditions of a graph homomorphism, namely that if there is a join between sides s and t in S, then there is a join between the corresponding sides s' and t' in S'. (Also, obviously, we require that the two sides of a base in S are the two sides of a corresponding base in S' with the same label.) A homomorphism will essentially be a coordinate transform. Note that a homomorphism can map two sides in S to a single side in S'. If it does not, then it is a (subgraph) isomorphism.

A subgraph transform is a very useful thing. As a special case, one could transform the whole side graph, not just a subgraph of it. So this includes the operation of "defining a new set of primary paths on a reference side graph", which is something like a liftover-style mapping from one version of the reference to another. It also includes the operation of taking the two paths in a reference side graph S' that represent the two copies of chromosome 1 in an individual and making a new side graph S representing the diploid genome of that individual that has two separate sequences, one for each of these paths. Here we expect the two paths to share many nodes in the reference S', so this is a homomorphism, not a subgraph isomorphism.

I like the subgraph transform much better than defining an explicit "split" operation on nodes of a reference side graph. If you are going to split, this is messy enough that you should be doing your work on an entirely new side graph, and keeping track of a homomorphism between the new side graph you are creating and the original side graph.

-D

On Tue, Apr 7, 2015 at 11:28 AM, Benedict Paten [email protected] wrote:

@richarddurbin / @lh3 / @anovak / @haussler - I think it would be better to formulate a sub-graph of a side graph in practice as a set of segments and joins, rather than a set of sequences and joins.

The pros of this are:

(1) For a given position + radius query the client gets back exactly the subsequences within the radius - there is no need to recompute this on the client.

(2) We can use segment+join subgraphs to describe annotations, such as exons or introns, that include the variation that's present in the population. The sequence+join subgraph, which does not bound the subsequences, does not make it easy to express this.

The cons (as I see it) are: A small increase in the amount of data that needs to be transferred between server and client, because you need to specify the additional subsequence data.

Did I miss anything compelling as to why we wouldn't make the job of the client a little easier?

On Mon, Apr 6, 2015 at 12:38 AM, Richard Durbin [email protected]

wrote:

@lh3 The only sequences I can see as a problem for obtaining the full sequence with my proposal are the well-known primary chromosome references, and I am happy to cache them locally. Then surely my proposal is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph you need to remap to a new set of paths. You have implemented path mapping, which is powerful, but I suggest that this is not simpler than having an offset on sequence segments as @lh3 and @adamnovak propose. In any case, I still think that given you have this, mapping between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li [email protected] wrote:

@ekg https://github.com/ekg your (2) and @adamnovak < https://github.com/adamnovak>'s (4) are different. In my understanding, you offset everything, including sequences, joins and the coordinate system. You need to apply offset everywhere when we translate between the whole graph and the subgraph. @adamnovak https://github.com/adamnovak just marks a sequence in the subgraph as a subsequence of the whole graph. We don't need to offset joins and most importantly we use the same coordinate system. @richarddurbin https://github.com/richarddurbin wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a subgraph, we would usually like to retrieve subsequences included in this subgraph. Having an offset field will make this easier. Of course you can retrieve subsequence later with your proposal but getting the coordinates needs extra work.

— Reply to this email directly or view it on GitHub < https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152>.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-89960344.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-90690409.

haussler avatar Apr 08 '15 18:04 haussler

I am happy with this with respect to the simple subgraph.

With respect to the more complex thing, I am happy with the isomorphism. You would specify that as an ordered list of non-overlapping paths. Then one takes all the joins linking bases in these paths. This defines a new side graph. The simple case is segments. But this also handles for you the establishment of a new reference system as we were discussing earlier today. The fact that the paths are non-overlapping makes it an isomorphism, and allows you to keep track of the mapping and its inverse, which allows you to merge subgraphs.

I am not so happy with the homomorphism where one base in the original side graph is mapped to two (or more) separate bases in the new graph. I am not sure I need the individual side graph containing two separated haplotypes that you describe. But maybe I don't see valuable things that you see. At any rate, I would make that a separate step. For the purposes we were discussing previously, the isomorphism is sufficient, and it would just be made of segments of the original.

Richard

On 8 Apr 2015, at 19:25, haussler [email protected] wrote:

As I understand it, the controversy of

subgraph = set of sequences and joins

vs

subgraph = set of segments of sequences and joins

revolves around where and how the coordinate transforms are done. I'm afraid if we specify subgraph = set of segments of sequences and joins we start down a slippery slope. It will bake in the coordinate transform required when going from having a sequence establish the basic coordinate system to having a segment of that sequence establish the basic coordinate system. However, almost as frequently we will want to concatenate two sequences and make a single coordinate system (i.e. sequence) for that concatenation. Most generally, we will want to concatenate a series of segments and make a single coordinate system for that object. I think we should make all these coordinate transforms, including sequence -> segment, one and the same, for simplicity. That means having (just) two distinct notions:

  1. (simple) subgraph = set of sequences and joins taken from an existing side graph. Defining a simple subgraph is an operation that does not create a new side graph, it only references part of an existing side graph.
  2. subgraph transform = creation of a new side graph that has its own new coordinate system but is demonstrably homomorphic to a subgraph of (the most granular form of) an existing side graph.

Here by "most granular form" I mean the side graph obtained by making every sequence have length 1. By "demonstrably homomorphic to" I mean that there is a defined procedure for going from a side s in the new side graph S to the corresponding side s' in the original side graph S' that satisfies the conditions of a graph homomorphism, namely that if there is a join between sides s and t in S, then there is a join between the corresponding sides s' and t' in S'. (Also, obviously, we require that the two sides of a base in S are the two sides of a corresponding base in S' with the same label.) A homomorphism will essentially be a coordinate transform. Note that a homomorphism can map two sides in S to a single side in S'. If it does not, then it is a (subgraph) isomorphism.

A subgraph transform is a very useful thing. As a special case, one could transform the whole side graph, not just a subgraph of it. So this includes the operation of "defining a new set of primary paths on a reference side graph", which is something like a liftover-style mapping from one version of the reference to another. It also includes the operation of taking the two paths in a reference side graph S' that represent the two copies of chromosome 1 in an individual and making a new side graph S representing the diploid genome of that individual that has two separate sequences, one for each of these paths. Here we expect the two paths to share many nodes in the reference S', so this is a homomorphism, not a subgraph isomorphism.

I like the subgraph transform much better than defining an explicit "split" operation on nodes of a reference side graph. If you are going to split, this is messy enough that you should be doing your work on an entirely new side graph, and keeping track of a homomorphism between the new side graph you are creating and the original side graph.

-D

On Tue, Apr 7, 2015 at 11:28 AM, Benedict Paten [email protected] wrote:

@richarddurbin / @lh3 / @anovak / @haussler - I think it would be better to formulate a sub-graph of a side graph in practice as a set of segments and joins, rather than a set of sequences and joins.

The pros of this are:

(1) For a given position + radius query the client gets back exactly the subsequences within the radius - there is no need to recompute this on the client.

(2) We can use segment+join subgraphs to describe annotations, such as exons or introns, that include the variation that's present in the population. The sequence+join subgraph, which does not bound the subsequences, does not make it easy to express this.

The cons (as I see it) are: A small increase in the amount of data that needs to be transferred between server and client, because you need to specify the additional subsequence data.

Did I miss anything compelling as to why we wouldn't make the job of the client a little easier?

On Mon, Apr 6, 2015 at 12:38 AM, Richard Durbin [email protected]

wrote:

@lh3 The only sequences I can see as a problem for obtaining the full sequence with my proposal are the well-known primary chromosome references, and I am happy to cache them locally. Then surely my proposal is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph you need to remap to a new set of paths. You have implemented path mapping, which is powerful, but I suggest that this is not simpler than having an offset on sequence segments as @lh3 and @adamnovak propose. In any case, I still think that given you have this, mapping between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li [email protected] wrote:

@ekg https://github.com/ekg your (2) and @adamnovak < https://github.com/adamnovak>'s (4) are different. In my understanding, you offset everything, including sequences, joins and the coordinate system. You need to apply offset everywhere when we translate between the whole graph and the subgraph. @adamnovak https://github.com/adamnovak just marks a sequence in the subgraph as a subsequence of the whole graph. We don't need to offset joins and most importantly we use the same coordinate system. @richarddurbin https://github.com/richarddurbin wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a subgraph, we would usually like to retrieve subsequences included in this subgraph. Having an offset field will make this easier. Of course you can retrieve subsequence later with your proposal but getting the coordinates needs extra work.

— Reply to this email directly or view it on GitHub < https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152>.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-89960344.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-90690409.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-90995341.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

richarddurbin avatar Apr 08 '15 23:04 richarddurbin

This issue is also coming up an another thread Updating methods to work with side graphs. (#269)

Can we make a connection here? Over there is discussion of string graph - to - side graph transformation by specifying a disjoint set of paths in a string graph that then become sequences in the side graph. The way I see this in general is that there are two kinds of operations:

  1. Declaring a subgraph of an existing graph, which does not demand the actual creation of or separate existence of a second derived graph to represent this subgraph
  2. Establishing a relationship between two different graphs, call one the master (assumed already to exist) and the other the derived graph (which may already exist or may be being defined). Unlike (1), this demands the separate existence of the derived graph, and thus if necessary, its creation.

With regard to the master graph, the most fundamental objects are sequences, joins and paths. We want everything to be expressed in terms of these as much as possible. The simplest form of operation (1) is to use as Richard suggested only the 2 most fundamental objects: sequences and joins, which is what I think we have now agreed on.

In operation (2), establishing a relationship between a derived graph and a master graph, as Erik points out, the most fundamental thing is to identify a path in the master graph with a sequence in the derived graph. When the consecutive bases of a sequence in the derived graph are mapped to identical consecutive bases of a path in the master graph we get our simplest and most fundamental relationship between the master and the derived graph. We definitely want to have an operation that establishes this. By mapping oriented bases, this operation establishes a mapping M from each side s in the derived graph to a corresponding side M(s) in the master graph. Mapping joins from the derived graph to the master graph then follows: a join (s,t) in the derived graph must map to the join (M(s), M(t)) in the master graph. We should be able to add any such joins we want to the derived graph.

Any mapping M from the derived graph to the master graph in which the consecutive bases in each sequence of the derived graph map to consecutive identical bases in a path on the master graph, and any (explicit) join (s,t) in the derived graph maps to the (implicit or explicit) join (M(s), M(t)), is a graph homomorphism from the derived graph into the master graph. Conversely, every graph homomorphism from a derived graph to the master graph has this form. Since graph homomorphism is the most fundamental mapping in graph theory, this strongly argues that we should support it.

Isomorphism is just two homomorphisms, one from the derived graph to the master, and another from the master to the derived graph. So we can get isomorphism easily if we have homomorphism. However, we can't get homomorphism from isomorphism.

For these reasons I suggest that there be an operation that creates a derived side graph from a master side graph by homomorphism. The input would be (1) a set P of paths on the master graph (which will define the homomorphism M) and (2) a set J of joins between sides in the path set P such that if sides s and t are joined, then the corresponding sides M(s) and M(t) in the master graph are joined. The output would be a new derived side graph G(P) and a homomorphism M from G(P) into the master graph such that (1) each path in P is a separate sequence in G(P), even when these paths overlap in the master graph, and (2) for each (explicit) join in G(P) between sides s and t in the paths of P, there is a join in the master graph between M(s) and M(t).

A homomorphism M is a mapping, so formally it is a set of ordered pairs of the form (x, M(x)) where x is an object in the derived graph G(P) and M(x) is the corresponding object in the master graph. To be represented efficiently, we should probably represent M as a set of pairs (x, M(x)) where x is a segment or a join. These are enough.

Again, isomorphism is just a special case of this relatively clean and general setup. I gave an example in my previous post where a homomorphism that is not an isomorphism will be useful.

-D

On Wed, Apr 8, 2015 at 5:11 PM, David Haussler [email protected] wrote:

On Wed, Apr 8, 2015 at 4:12 PM, Richard Durbin [email protected] wrote:

I am happy with this with respect to the simple subgraph.

With respect to the more complex thing, I am happy with the isomorphism. You would specify that as an ordered list of non-overlapping paths. Then one takes all the joins linking bases in these paths. This defines a new side graph.

Yes, with the recognition that we allow, in an extreme case, a path to be a single join with no bases, which connects to bases that appear in other paths that do contain bases. This case is needed, e.g. when the new side graph you are building consists of two base-containing paths from the original graph with a join between them somewhere in the middle. The join is considered to be a third path. The definition of overlap is of course sharing a base. So this third path does not overlap with the first two paths.

The simple case is segments. But this also handles for you the establishment of a new reference system as we were discussing earlier today. The fact that the paths are non-overlapping makes it an isomorphism, and allows you to keep track of the mapping and its inverse,

yes

which allows you to merge subgraphs.

yes but you can just as easily merge graphs defined by homomorphisms as isomorphisms, it is just that you have a choice as to whether to keep two nodes that map the same place separate or merge them together. Your rule here can be tailored to the app.

I am not so happy with the homomorphism where one base in the original side graph is mapped to two (or more) separate bases in the new graph.

Let's talk of the mapping going the other way, because then it is a mapping. In a homomorphism, each base in the new graph maps to one and only one place in the original graph. So it is a legitimate mapping, just not always one-to-one.

I am not

sure I need the individual side graph containing two separated haplotypes that you describe. But maybe I don't see valuable things that you see.

My guess is the most popular use of the not-one-to-one homomorphism will to be when people want to make customized reference structures that are conveniently defined as side graphs, and use them to map their raw reads, read pairs or read sets to, or to count things or do other various analysis. These would be used when you have partial phasing information for the sample you are working with that allows you to split some nodes of the standard reference structure. For example, in the standard reference structure there may be a sequence abcde with a variant b' and a variant d' coming off as 1 base bubbles. But if you know that the primes are phased together in the sample you are dealing with, you may prefer to map your raw data to a custom version of the reference with c split into c and c', so that there is a primary path abcde with a bubble b'c'd'. You would keep track of your work via a homomorphism from the new graph to the original reference structure in which c and c' in the new graph both map to c in the original reference graph.

At any

rate, I would make that a separate step. For the purposes we were discussing previously, the isomorphism is sufficient, and it would just be made of segments of the original.

But I think when you work it out, the homomorphism requires almost no additional machinery beyond the isomorphism, and is more flexible. -D

Richard

On 8 Apr 2015, at 19:25, haussler [email protected] wrote:

As I understand it, the controversy of

subgraph = set of sequences and joins

vs

subgraph = set of segments of sequences and joins

revolves around where and how the coordinate transforms are done. I'm afraid if we specify subgraph = set of segments of sequences and joins we start down a slippery slope. It will bake in the coordinate transform required when going from having a sequence establish the basic coordinate system to having a segment of that sequence establish the basic coordinate system. However, almost as frequently we will want to concatenate two sequences and make a single coordinate system (i.e. sequence) for that concatenation. Most generally, we will want to concatenate a series of segments and make a single coordinate system for that object. I think we should make all these coordinate transforms, including sequence -> segment, one and the same, for simplicity. That means having (just) two distinct notions:

  1. (simple) subgraph = set of sequences and joins taken from an existing side graph. Defining a simple subgraph is an operation that does not create a new side graph, it only references part of an existing side graph.
  2. subgraph transform = creation of a new side graph that has its own new coordinate system but is demonstrably homomorphic to a subgraph of (the most granular form of) an existing side graph.

Here by "most granular form" I mean the side graph obtained by making every sequence have length 1. By "demonstrably homomorphic to" I mean that there is a defined procedure for going from a side s in the new side graph S to the corresponding side s' in the original side graph S' that satisfies the conditions of a graph homomorphism, namely that if there is a join between sides s and t in S, then there is a join between the corresponding sides s' and t' in S'. (Also, obviously, we require that the two sides of a base in S are the two sides of a corresponding base in S' with the same label.) A homomorphism will essentially be a coordinate transform. Note that a homomorphism can map two sides in S to a single side in S'. If it does not, then it is a (subgraph) isomorphism.

A subgraph transform is a very useful thing. As a special case, one could transform the whole side graph, not just a subgraph of it. So this includes the operation of "defining a new set of primary paths on a reference side graph", which is something like a liftover-style mapping from one version of the reference to another. It also includes the operation of taking the two paths in a reference side graph S' that represent the two copies of chromosome 1 in an individual and making a new side graph S representing the diploid genome of that individual that has two separate sequences, one for each of these paths. Here we expect the two paths to share many nodes in the reference S', so this is a homomorphism, not a subgraph isomorphism.

I like the subgraph transform much better than defining an explicit "split" operation on nodes of a reference side graph. If you are going to split, this is messy enough that you should be doing your work on an entirely new side graph, and keeping track of a homomorphism between the new side graph you are creating and the original side graph.

-D

On Tue, Apr 7, 2015 at 11:28 AM, Benedict Paten < [email protected]> wrote:

@richarddurbin / @lh3 / @anovak / @haussler - I think it would be better to formulate a sub-graph of a side graph in practice as a set of segments and joins, rather than a set of sequences and joins.

The pros of this are:

(1) For a given position + radius query the client gets back exactly the subsequences within the radius - there is no need to recompute this on the client.

(2) We can use segment+join subgraphs to describe annotations, such as exons or introns, that include the variation that's present in the population. The sequence+join subgraph, which does not bound the subsequences, does not make it easy to express this.

The cons (as I see it) are: A small increase in the amount of data that needs to be transferred between server and client, because you need to specify the additional subsequence data.

Did I miss anything compelling as to why we wouldn't make the job of the client a little easier?

On Mon, Apr 6, 2015 at 12:38 AM, Richard Durbin < [email protected]>

wrote:

@lh3 The only sequences I can see as a problem for obtaining the full sequence with my proposal are the well-known primary chromosome references, and I am happy to cache them locally. Then surely my proposal is simplest, with no requirement for coordinate offsets or path mapping.

@ekg You are right that to map between ReferenceSets on the same graph you need to remap to a new set of paths. You have implemented path mapping, which is powerful, but I suggest that this is not simpler than having an offset on sequence segments as @lh3 and @adamnovak propose. In any case, I still think that given you have this, mapping between your representation and side graphs is natural.

On 3 Apr 2015, at 14:57, Heng Li [email protected] wrote:

@ekg https://github.com/ekg your (2) and @adamnovak < https://github.com/adamnovak>'s (4) are different. In my understanding, you offset everything, including sequences, joins and the coordinate system. You need to apply offset everywhere when we translate between the whole graph and the subgraph. @adamnovak < https://github.com/adamnovak> just marks a sequence in the subgraph as a subsequence of the whole graph. We don't need to offset joins and most importantly we use the same coordinate system. @richarddurbin <https://github.com/richarddurbin

wants to go a step further by even not specifying the offset.

@richarddurbin https://github.com/richarddurbin When we retrieve a subgraph, we would usually like to retrieve subsequences included in this subgraph. Having an offset field will make this easier. Of course you can retrieve subsequence later with your proposal but getting the coordinates needs extra work.

— Reply to this email directly or view it on GitHub < https://github.com/ga4gh/schemas/issues/275#issuecomment-89297152>.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub <https://github.com/ga4gh/schemas/issues/275#issuecomment-89960344 .

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-90690409.

— Reply to this email directly or view it on GitHub < https://github.com/ga4gh/schemas/issues/275#issuecomment-90995341>.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/275#issuecomment-91064230.

haussler avatar Apr 10 '15 22:04 haussler

It's great to see some mathematical analysis :)

~p

pgrosu avatar Apr 15 '15 17:04 pgrosu