ga4gh-schemas icon indicating copy to clipboard operation
ga4gh-schemas copied to clipboard

Calls search endpoint

Open david4096 opened this issue 8 years ago • 8 comments

This PR enables variant call data to be queried directly. Currently, call data are only made available in the variant message itself, which makes large sample sets difficult to work with. By enabling a separate endpoint for querying call data we can greatly improve common access patterns without disturbing the existing method.

The endpoint, calls/search receives a SearchCallsRequest, in which one provides a variant_set_id, call_set_id, or variant_id. This allows pages of Calls to be returned matching the search request as opposed to in a variant message. To carry out these requests and to support the GetCallSet request variant_id and id have been added to the Call message.

The existing method of accessing call data is left in place, however, I believe in general this practice of modifying messages at query time to be problematic. It leads to the situation where you have one identifier that could describe two unique returned messages. If this access pattern proves to serve for the same existing use cases we ought to deprecate placing calls in a variant message. For previous discussion on callsets see https://github.com/ga4gh/schemas/issues/583.

david4096 avatar Jul 07 '16 23:07 david4096

+1 this seems like a great improvement after writing some programs that search the calls using the current scheme.

@mbaudis @jeromekelleher @jacmarjorie @diekhans Any thoughts?

kozbo avatar Jul 12 '16 17:07 kozbo

This seems well worth trying out (as in +1).

Is there any reason to leave the array of call in a Variant object with this endpoint?

Kevin Osborn [email protected] writes:

+1 this seems like a great improvement after writing some programs that search the calls using the current scheme.

@mbaudis @jeromekelleher @jacmarjorie @diekhans Any thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.*

diekhans avatar Jul 12 '16 18:07 diekhans

@diekhans the calls array is left as is to improve the likelihood of acceptance by easing migration to the new access pattern.

david4096 avatar Jul 12 '16 18:07 david4096

@david4096 - this is a far more intuitive way to query genotype calls. Is variant_set_id to be mandatory or does this allow non-strandard querying across sets? If variant_id is not mandatory, is there any expectation that the calls will be ordered by genomic location? I assume you are aware that @reece and the VMC are looking at this area too.

sarahhunt avatar Jul 13 '16 16:07 sarahhunt

Thanks @sarahhunt. I've added comments stating the variant_set_id is required and the other fields are optional. I have not made the expectation that they will be returned by genomic position.

I am excited to see what the VMC comes up with for GA4GH! I see this PR as a simple iterative change from our current access patterns. I believe that adding some notion of Sequence and/or Allele and making the variant model unary will further clarify the domain. We need to provide the simple bridge from VCF, and I believe this is PR is a step.

david4096 avatar Jul 13 '16 16:07 david4096

@david4096 will demo this approach in a test program before merging

kozbo avatar Jul 14 '16 16:07 kozbo

A very sensible and welcome recommendation!

Thank you and looking forward to the demo, Paul

pgrosu avatar Jul 14 '16 16:07 pgrosu

As per @diekhans comment, I think removing the callSetIds parameter of variants/search is correct.

david4096 avatar Mar 08 '17 21:03 david4096