ga4gh-schemas icon indicating copy to clipboard operation
ga4gh-schemas copied to clipboard

Where are sequences stored in the graph?

Open ekg opened this issue 9 years ago • 14 comments

Reading common.avdl, I'm curious where the association between actual DNA sequences and segments is made. Why should it not be possible to directly access the sequences associated with a particular segment via the API?

ekg avatar Feb 25 '15 14:02 ekg

It's actually a bit more confusing than this. Because it is possible to ask questions about actual sequences via the Beacon API. It would be nice if we could keep everything consistent without bloating the simplest possible Beacon implementation.

On Wed, Feb 25, 2015 at 9:57 AM, Erik Garrison [email protected] wrote:

Reading common.avdl https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/common.avdl#L8, I'm curious where the association between actual DNA sequences and segments is made. Why should it not be possible to directly access the sequences associated with a particular segment via the API?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/243.

Alexander (Sasha) Wait Zaranek, PhD. [email protected] Director Informatics, Personal Genome Project, Harvard Medical School Chief Scientist, Curoverse Inc. Founder http://arvados.org

Get paid to write free software: https://curoverse.com/about/jobs/

awz avatar Feb 25 '15 15:02 awz

I meant to cancel a comment rather than close/reopen.

@awz So I'm not wrong in wondering how I could examine the internal structures of the database backing the graph via the API? Does the system provide any concept of nodes and edges with identifiers?

ekg avatar Feb 25 '15 15:02 ekg

@ekg - I except it is possible (maybe?), but for the time being the "how" seems to be extremely opaque. It would be great to have some documentation / code examples for this. Even if it is possible, this is so important that we might want to change things around so the code is much simpler/obvious.

awz avatar Feb 25 '15 15:02 awz

@awz It's expedient to be able to say "there is a node with id=N, get me that node and its context." In vg this is a very core concept: vg find -n <id> -c <steps> x.vg >node_context.vg. This matters a lot when working with the reference system and I'm having some trouble getting my head around how tools are supposed to operate, or at very least people are supposed to be able to debug what's going on with the backend of the API, if it isn't available.

ekg avatar Feb 25 '15 15:02 ekg

Anything is possible. Here a few standard algorithms (BFS/DFS) for doing this:

http://en.wikipedia.org/wiki/Breadth-first_search http://en.wikipedia.org/wiki/Depth-first_search

If you want you can even do topological sorting to speed some things up:

http://en.wikipedia.org/wiki/Topological_sorting

If you want you can hash the nodes. You can even build a B-tree which will simplify some searches:

http://en.wikipedia.org/wiki/B-tree

Again, many possibilities, but a careful discussion and a diagram of the algorithm(s) and implementation - preferably including pseudo-code - will make things easier to analyze, especially in terms of O() complexity.

~p

pgrosu avatar Feb 25 '15 16:02 pgrosu

@ekg In the GA4GH graph, sequences are simply strings, which are not so obvious in the schema. You can query a sequence/subsequence via Segment.start.sequenceId (PS: I am not so sure, though) and get the topology from Segment.startJoin and Segment.endJoin. Is this what you want?

BTW, in a GA4GH graph the end of a sequence more often joins to the middle of another sequence. Each segment end only has one join. This is very different from the vg graph (in my understanding) or any assembly graph.

lh3 avatar Feb 25 '15 17:02 lh3

@lh3 If this is how things work, then I see a way for me to integrate vg with this part of the API.

I don't maintain strong identities associated with nodes in the graph, as it might be convenient to change the identifiers used internally. vg::Paths have names and these directly correspond to the concept of sequences in GA4GH. As such, it's just necessary for me to generate join information, which is natively available from the path annotations in the graph index.

@pgrosu vg find implements a BFS based context expansion. If you're interested, you can try it with example data in the vg repo.

As an aside, this reference technology won't be adopted unless it can be communicated easily, and it is at risk of becoming very hard for anyone outside of this group to grok. I think some serious thought needs to be made about what is kept here and what is not. Adding lots of standard algorithms and clever things like execution environments to the API seems like a way to ensure that very few people ever manage to fully implement it.

I say this because I feel very reticent to implement much beyond the basic semantic structure of the graph, which I see as:

  • Segments (aka vg::Paths) defined by the "joins" of their start and end onto other Segments,
  • annotations on these implied sequences,
  • and ways to find strings and values in this system.

For this to be adopted widely, hundreds of people would need to do the same, just as they have for formats like FASTA, BED, and SAM, all of which are simple enough to define on a single presentation slide.

ekg avatar Feb 26 '15 10:02 ekg

As an aside, this reference technology won't be adopted unless it can be communicated easily, and it is at risk of becoming very hard for anyone outside of this group to grok. I think some serious thought needs to be made about what is kept here and what is not. Adding lots of standard algorithms and clever things like execution environments to the API seems like a way to ensure that very few people ever manage to fully implement it.

+1; I would actually go as far as to say that it's even becoming challenging for people who are in this group to grok.

Although, the execution environment stuff (from the perspective of the containers/workflow team) will be disparate from the read/reference/variation schemas.

fnothaft avatar Feb 26 '15 13:02 fnothaft

On Thu, Feb 26, 2015 at 5:52 AM, Frank Austin Nothaft < [email protected]> wrote:

As an aside, this reference technology won't be adopted unless it can be communicated easily, and it is at risk of becoming very hard for anyone outside of this group to grok. I think some serious thought needs to be made about what is kept here and what is not. Adding lots of standard algorithms and clever things like execution environments to the API seems like a way to ensure that very few people ever manage to fully implement it.

+1; I would actually go as far as to say that it's even becoming challenging for people who are in this group to grok.

I'm catching up on this thread with interest. I think we need to add documentation to the API, alongside the API schema, that describes the API. I think most people have learnt to understand SAM/BAM and VCF through equivalent high-level descriptions, e.g.:

http://samtools.github.io/hts-specs/SAMv1.pdf

We will need to do the same. I think these can be part of a module level documentation folder. This will be an easier entry point than simply trying to pick up one of the source files and iterating from there.

I do not think that the design is inherently "complex", or that there is some magically simpler definition, just that we need better ways of communicating it. That said - I think Heng (see Heng's thread) and others are asking good questions about simplification. Actually, as pointed out by Erik, the very basic graph definition, as based upon Richard's original schema, is very simple, and is an elegant fit to our problem.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/243#issuecomment-76180790.

benedictpaten avatar Feb 26 '15 16:02 benedictpaten

the very basic graph definition, as based upon Richard's original schema, is very simple, and is an elegant fit to our problem.

Exactly. To get pan-genomes up and running, we need a coordinate system. This shouldn't say much of anything about the underlying implementation. I support @richarddurbin's model because it is pretty solid in this regard, and has the nice property of providing stability even as the variation graph grows.

In the absence of practical experience with these structures, I'm concerned that a lot of effort is going into potential problems that won't end up being very serious in practice. These, when enforced via the API spec, can become impediments to adoption.

I propose putting the API development on hold until implementations, tests, and code, catch up. We could even use the collaboration/competition as an opportunity to qualitatively evaluate different approaches of working with this kind of reference system.

ekg avatar Feb 26 '15 17:02 ekg

In the absence of practical experience with these structures, I'm concerned that a lot of effort is going into potential problems that won't end up being very serious in practice. These, when enforced via the API spec, can become impediments to adoption.

I propose putting the API development on hold until implementations, tests, and code, catch up. We could even use the collaboration/competition as an opportunity to qualitatively evaluate different approaches of working with this kind of reference system.

Strong +1

fnothaft avatar Feb 26 '15 17:02 fnothaft

See also discussions in ga4gh/server#183. I also strongly agree with @ekg.

lh3 avatar Feb 26 '15 20:02 lh3

Thank you @ekg for the examples and code.

Definitely a strong +1 on more documentation. I'm also doing some catching up.

Thanks, Paul

pgrosu avatar Feb 27 '15 12:02 pgrosu

I agree with @ekg; we need practical experience with @richarddurbin's model. This is why I propose that we cut a release of the API soon (say 0.6.0), and concentrate some of our efforts on implementing this in the reference server. This will give us a stable platform to aim at, and those working on the higher level API design are free to continue working on git.

I think we need to be clear about what an 'implementation' means. To me, an implementation literally means a client and server working in graph mode, nothing more. In particular, how we generate the graphs that are used as input to the server is entirely unspecified. We can use whatever methods and models we like to generate the graphs. We just need to output these graphs in a format that is compatible with Richard's model, so that the reference server can use it. We're hoping to define this format precisely soon, so it should be a relatively straightforward process to adapt, say, vg, so that it can write files in this format.

However, it is important to note that we are not defining an official graph interchange format. This is simply a pragmatic, short term solution to allow us to start working with the API properly on real data. This will allow us to gain experience, and iterate on the API design based on practical, real world knowledge and data.

jeromekelleher avatar Feb 27 '15 15:02 jeromekelleher