ga4gh-schemas Readgroups Endpoint

Readgroups Endpoint

Open david4096 opened this issue 7 years ago • 15 comments

The access patterns for the reads API would be improved by adding a read groups endpoint (#650). Currently the readgroupset endpoint returns a repeated list of readgroups matching the search criteria. This is confusing because the readgroupset record is modified at query time.

This PR adds the /readgroups/search endpoint, which exposes the available read groups using familiar strict matching on name, bio_sample_id, dataset_id, and read_group_set_id. Read group set messages no longer contain the list of readgroups since they can be retrieved using readgroups/<read_group_id>. This simplifies the read group set endpoint by moving the bio_sample_id filter to the appropriate entity, ReadGroup, as a single read group set might contain reads for many samples.

Jul 07 '16 20:07 david4096

I would be very careful when using required when defining a protobuf message - please take at the following note as required is forever:

:exclamation: Required Is Forever You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead. Some engineers at Google have come to the conclusion that using required does more harm than good; they prefer to use only optional and repeated. However, this view is not universal.

As long as we stay with proto3 message types, that should not be a problem but if we mix them as follows, this is still relevant (this is also from the same page):

Using proto3 Message Types

It's possible to import proto3 message types and use them in your proto2 messages, and vice versa. However, proto2 enums cannot be used in proto3 syntax.

Jul 08 '16 00:07 pgrosu

btw, I am +1 on adding the end point, it's only the search criteria that needs to be worked through.

Jul 08 '16 12:07 diekhans

Given the current data model, bio_sample_id, name, and dataset_id are all named fields in the ReadGroup message. This means they can be searched using these fields without restricting to a read group set. I believe the original schema designers were clever regarding this and left a path for accessing read groups at this granularity.

We could move ahead with presenting a more sensible offering based on the existing message fields as the current metadata is hampered without this change. If other search request fields become desired, we can add them to the request as needed.

The use case is specifically filtering read groups by their bioSampleId, although it seems sensible to provide name filtering as well. @jeromekelleher: for performing a search reads request the read group IDS have been included in the ReadGroupSet message itself. That common access pattern is unchanged.

Jul 08 '16 17:07 david4096

The point is to allow restricting to a readgroupset, not require it. readgroupset are the most common (based on the largest cancer genomics research project to date) data grouping for analysis.

David Steinberg [email protected] writes:

Given the current data model, bio_sample_id, name, and dataset_id are all named fields in the ReadGroup message. This means they can be searched using these fields without restricting to a read group set, I believe the original schema designers were clever regarding this and left a path for accessing read groups at this granularity.

We could move ahead with presenting a more sensible offering based on the existing message fields as the current metadata is hampered without this change. If other search request fields become desired, we can add them to the request as needed.

The use case is specifically filtering read groups by their bioSampleId, although it seems sensible to provide name filtering as well. @jeromekelleher: for performing a search reads request the read group IDS have been included in the ReadGroupSet message itself. That common access pattern is unchanged.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*

Jul 10 '16 08:07 diekhans

@diekhans So do you think making the dataset_id required and the read_group_set_id optional makes more sense?

Jul 11 '16 22:07 david4096

Hi David,

Think of the bigger picture, what are we really interested in? We want to analyze clean quality-controlled groupings of data (reads, variants, etc.) for clinical purposes (i.e. disease identification, personalized medicine, etc). We want to search by metadata and not drill too closely on the bottom level unless really necessary, and usually that is performed by reaching back to the sequencing center that submitted them, in case there are significant issues with the cloud-uploaded data. We have worked too long with with (g)VCFs and BAMs, which slow us down and become sub-optimal for the type of large-scale analysis that we want to target. We want to query a system and through macros like this (yes I know it looks rough, but omitted the other details for clarity to drive home the point):

prog GermlineAndCancer {
  qc.reads <- using.all.reads(qc="validated", filter="*carcinogenic*")
  variants.germline <- germline.pipeline.with.qc(qc.reads)
  variants.cancer   <- cancer.pipeline.with.qc(qc.reads)
  report.%date% <- saveAsHTML(getOverlapsAndRare(variants.germline, variants.cancer), globalStagingRepositoryByUser="[email protected]")
}

run.daily( "GermlineAndCancer" )  // Daily-streaming data will eventually trigger new key, rare variants.

So the key question is what type of information and data-structures would most easily and correctly, get us to perform such queries in the most intuitive way. That is really the key goal of GA4GH that matters the most.

Hope it helps, Paul

Jul 12 '16 00:07 pgrosu

Thanks Paul, that's very interesting. The decision on whether to include the read_group_set_id in the request seems like the last contention. Based on @diekhans observation, there is a good case for including it as an optional field of the request, which is reflected in the most recent commit. Can I get your +1?

Jul 12 '16 17:07 david4096

+1!

Jul 12 '16 18:07 diekhans

Hi David,

The issue is not that it is optional, but that a Readgroup can belong to multiple ReadGroupSets. If it belongs to multiple ones, does that mean we have to search all ReadGroupSets that this ReadGroup belongs to?

A ReadGroup can be a sample, but I want to combine with other samples that are high quality to perform a specific analysis. Then I want to find other ReadGroupSets where this QC'ed sample was analyzed in, so that I do not duplicate results which I can use for comparison. How natural would such a query be to perform?

Think about this scenario. A baby is born today with a predisposition to a specific cancer subtype. I want sequence that baby every year for the rest of his/her life including the parents at the time of birth. More refined samples of the cancer's molecular evolution will be biopsied on a more frequent basis that reflect the aggressiveness of the stage of the cancer. These might be many ReadGroupSets made up of multiple ReadGroups which I might - or might not - want to average out for my comparison.

We are doing deep sequencing these days because we are limited in the capacity to quickly determine the active site in the form of a 3D confirmation of protein sequences, and their interactions. Thus we are left with what we can inspect most quickly, which are gene variations that become most probabilistically significant. Next-Generation Sequencing has always been an indirect analysis for understanding the some types of biological processes.

So in order to correlate with diseases and possible drug targets - and their downstream interactions - we want to perform queryable analysis on a large scale.

Do you see my dilemma? We can add a ReadGroupSet id to a ReadGroup as optional, but the data-structure is back to the form of a BAM which we are trying to get away from. We have been having the same mindset towards NGS for the past 7 years - and reverting to it periodically the past two years here in GA4GH - but the goal is to make the wire-protocol fit large-scale, global multi-repository analysis.

If we add it now and large datasets get loaded this way at different repositories, and several years down the line we change the design to be more inter-connected to suite new analysis techniques, does that mean we have to reload the data, or would we be able to perform schema evolution with minimal updates to the data to fix it?

We have to start with the analysis goals, and tailor the data-structures to fit it.

Hope you can see where I am coming from.

Thank you, Paul

Jul 13 '16 00:07 pgrosu

Complex queries for readgroups should be performed using the sequencing metadata.

Paul Grosu [email protected] writes:

Hi David,

The issue is not that it is optional, but that a Readgroup can belong to multiple ReadGroupSets. If it belongs to multiple ones, does that mean we have to search all ReadGroupSets that this ReadGroup belongs to?

A ReadGroup can be a sample, but I want to combine with other samples that are high quality to perform a specific analysis. Then I want to find other ReadGroupSets where this QC'ed sample was analyzed in, so that I do not duplicate results which I can use for comparison. How natural would such a query be to perform?

Think about this scenario. A baby is born today with a predisposition to a specific cancer subtype. I want sequence that baby every year for the rest of his/her life including the parents at the time of birth. More refined samples of the cancer's molecular evolution will be biopsied on a more frequent basis that reflect the aggressiveness of the stage of the cancer. These might be many ReadGroupSets made up of multiple ReadGroups which I might - or might not - want to average out for my comparison.

We are doing deep sequencing these days because we are limited in the capacity to quickly determine the active site in the form of a 3D confirmation of protein sequences, and their interactions. Thus we are left with what we can inspect most quickly, which are gene variations that become most probabilistically significant. Next-Generation Sequencing has always been an indirect analysis for understanding the some types of biological processes.

So in order to correlate with diseases and possible drug targets - and their downstream interactions - we want to perform queryable analysis on a large scale.

Do you see my dilemma? We can add a ReadGroupSet id to a ReadGroup as optional, but the data-structure is back to the form of a BAM which we are trying to get away from. We have been having the same mindset towards NGS for the past 7 years - and reverting to it periodically the past two years here in GA4GH - but the goal is to make the wire-protocol fit large-scale, global multi-repository analysis.

If we add it now and large datasets get loaded this way at different repositories, and several years down the line we change the design to be more inter-connected to suite new analysis techniques, does that mean we have to reload the data, or would we be able to perform schema evolution with minimal updates to the data to fix it?

We have to start with the analysis goals, and tailor the data-structures to fit it.

Hope you can see where I am coming from.

Thank you, Paul

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.*

Jul 13 '16 05:07 diekhans

@david4096 will create a demo program to show this approach before merging.

Jul 14 '16 16:07 kozbo

I greatly welcome the idea of a demo, as it is my preferred approach. It will give us the opportunity to hash out and test ideas together for integrating the best ideas from everyone.

Jul 14 '16 16:07 pgrosu

Hi David,

I'm assuming the demos will be presented via Github. While you are building the demo try to see if generating temporary federated tables might help you in joining multiple data sources together. These would work as short-lived dynamic databases, where the transitory tables optimize the memoization similar to a dynamic programming approach. With time we can have a cache scheduler where the tables become more permanent if they become highly accessed, and would have the possibility of talking to each other. It is basically a way of balancing the wire-protocol load versus these temporary data-hubs' throughput.

These federated tables should be auto-generated by the queries performed, and thus can live ephemerally but would aid in speeding up the dynamic query accessing multiple sites and/or data-sources.

Hope it helps, Paul

Jul 15 '16 00:07 pgrosu

@pgrosu That's very interesting, I agree! I currently have enough resources to approach this in such a way so as to verify the basic access patterns, however, I am interested in whether you have some scaffolding to work with in approaching the federation and optimization problems.

I'll let you know if I run into any issues putting together a most limited demonstration!

Jul 18 '16 22:07 david4096

@david4096 I have ways to test things on a multi-node cluster where the distributed repositories would be simulated by throttling the bandwidth among running processes. There are tools like Pipe Viewer, or RequestsThrottler where one can tweak the bandwidth throughput among parallel running processes representing subsets of repositories.

I would not rush it, as it is better to take the time to do it right. Let's all work together to find the optimal approach, as there will be multiple iterations we will have to go through. It'll be fun :)

Feel free to ping me anytime if you need anything. It would be nice to work together to come up with an elegant solution for this, which I know would benefit everyone as well.

Cheers, ~p

Jul 18 '16 23:07 pgrosu

ga4gh-schemas ga4gh-schemas copied to clipboard

Readgroups Endpoint

ga4gh-schemas
ga4gh-schemas copied to clipboard