ga4gh-schemas Relating variant sets and readgroupsets

Relating variant sets and readgroupsets

Open david4096 opened this issue 8 years ago • 8 comments

The relationship between variant sets and read group sets should be explicit. The question of: "which BAM was used for which callset"? is currently answered implicitly through the variant set metadata. However, the GA4GH has the opportunity to improve the situation by making the relationships between the Read Group Sets used for an analysis pipeline, and the resulting Variant Set.

This could be carried out in a few ways. Currently, one can construct a request to determine if a callset came from the same sample as a readgroup via their bio_sample_ids. However, the same sample can appear in multiple readgroups. This leaves the statement, "which BAM or set of RG tags were used for constructing this callset," difficult to answer directly.

One might consider adding a read_group_ids field to a CallSet record, making that relationship explicit. This would allow individual RG tags across BAMs to be reassembled as needed to provide the underlying data for each call. It would then be trivial to construct a query that asks, which BAMs were used when making this call.

An alternative would be to provide a map of call_set_id:read_group_id in the variant set metadata.

I understand that for phased calling these relationships can become complicated. Any insight into the other basic requirements is valued!

Jul 12 '16 21:07 david4096

Under the SRA/ENA model, this is address by the experiment/run/analysis metadata. It is only implicit in GA4GH API because the metadata model is incomplete.

Both samples and readgroups can have variants called in multiple times and multiple ways . One has to have the provenience to correctly answer these questions.

Suggest reading: SRA handbook: http://www.ncbi.nlm.nih.gov/books/NBK47528/ GDC data model: https://gdc.nci.nih.gov/developers/gdc-data-model.

Jul 22 '16 14:07 diekhans

But please be aware that this type of metadata isn't treated by the MTT ("everything but the sequence"); though we'll jump in upon specific requests ...

Jul 22 '16 21:07 mbaudis

See also #390 for earlier thoughts on a data model for tracking object provenance.

Jul 24 '16 14:07 dglazer

Well, the good news is that there is already a way that has been kindly prepared for us by Google. At the beginning of July they published a very nice paper called Goods: Organizing Google’s Datasets. We can implement such a system - as our datasets would also become quite varied and large - and they have already worked out the issues with tracking the provenance, and inferred metadata.

Since our metadata would be provided and stored with different types of objects - through the nice work done by the MTT team - the first part of the search capability is basically a given. Then comes the second part of collecting and building inferred metadata through metadata inference, which can be performed via the creation of transitive closure graphs on the sets loaded in the different repositories. This automatically provides us with the provenance information as well. Thus the ability to inspect connected BAMs with associated CallSets becomes a trivial query.

They store the datasets in a catalog, which is searchable via their Google Dataset Search (Goods) API. For us we can figure out a name that would also naturally work for our implementation. Below a figure illustrating how the catalog is organized:

goods-catalog-with-text

In order to connect datasets together, we can even provide a query system similar to the Google Knowledge Graph, in order to determine connected sets of data, and/or those that are processed as results through similar pipelines with overlapping functional context. Below are two links to give you an idea of how knowledge graphs would work:

For example, we can utilize semi-lattices representing sets of connected data, by collapsing using subsets of annotated information to work with groups of equivalent versions, and/or other matching criteria as follows:

semi-lattice

We can even propagate metadata shared among sets of data in order to consolidate information, allowing us to formulate more direct queries:

propagate-complete

Hope all of these ideas will help and will spark ideas, which is what is the drive behind great projects like FireCloud and ISB-CGC in enabling us to pave paths towards handling collections of millions and billions of data and results.

Paul

Jul 25 '16 00:07 pgrosu

Thanks for all of the input! I am trying to get to putative changes to the schemas that allow for this tracking because I believe it is not in place.

Currently, it is not possible to state which variant set came from which read group sets. My suggestion is to represent these data by making the following change:

Add a read group ID list to the Call Set message. Each Call Set comes from a specific set of read groups. That way, one can go back and view the pileup at a given position, for example.

For the data I have observed one can derive these relationships, however, this change would make it explicit.

Aug 08 '16 23:08 david4096

@diekhans

Both samples and readgroups can have variants called in multiple times and multiple ways . One has to have the provenience to correctly answer these questions.

In the GA4GH metadata model a callset is given a biosample ID. However, if I understand correctly, this is not enough to identify the specific readgroups a callset came from. Would adding a list of Read Group Ids to a callset message cover this case?

Oct 26 '16 21:10 david4096

I believe associating with readgroups is the missing piece

Oct 26 '16 21:10 diekhans

I'd like to close this by adding a list of read_group_ids to the CallSet message. The problem is that we have tagged callsets with a biosample ID, which if I understand this thread, is not entirely correct.

A ReadGroup is always from a single biosample ID, but a callset can be made from multiple read groups. That means that it is possible to construct a callset that is for multiple samples. For 1kgenomes, this may seem like an odd case, but I think we may have made an incorrect assumption of tagging callsets with a single biosample ID.

It seems to me the correct access pattern is to provide filtering of readgroups by their biosample ID, and then filtering callsets by their read group IDs. This avoids the scenario of improperly labeling a callset as being from a single sample, when in fact it is from multiple.

The problem is that, in practice, much useful interchange and analysis can be done without the BAM. That means that we need to provide an access pattern for when someone has metadata about samples, but no alignment data.

If the biosample ID were a repeated field in CallSets then we could support the case when metadata is available about a call, but no read alignment. In the case where multiple samples were used to make a call, both biosample IDs would be provided.

To close this issue I suggest we do the following:

Make biosample_ids on CallSets a repeated field
Add a repeated field for read_group_ids on the CallSet message

It would be nice to have a search method for callsets to return any callsets coming from a list of provided read group IDs.

And I might as well ask, @mbaudis @diekhans can a read group come from multiple samples? I can imagine that we should make the biosample_id on readgroups a repeated field in that case. This would allow us to model spiking a sample for sequence nicely.

Feb 23 '17 20:02 david4096

ga4gh-schemas ga4gh-schemas copied to clipboard

Relating variant sets and readgroupsets

ga4gh-schemas
ga4gh-schemas copied to clipboard