ga4gh-schemas
ga4gh-schemas copied to clipboard
Calls search endpoint
This PR enables variant call data to be queried directly. Currently, call data are only made available in the variant message itself, which makes large sample sets difficult to work with. By enabling a separate endpoint for querying call data we can greatly improve common access patterns without disturbing the existing method.
The endpoint, calls/search
receives a SearchCallsRequest, in which one provides a variant_set_id
, call_set_id
, or variant_id
. This allows pages of Calls to be returned matching the search request as opposed to in a variant message. To carry out these requests and to support the GetCallSet request variant_id
and id
have been added to the Call
message.
The existing method of accessing call data is left in place, however, I believe in general this practice of modifying messages at query time to be problematic. It leads to the situation where you have one identifier that could describe two unique returned messages. If this access pattern proves to serve for the same existing use cases we ought to deprecate placing calls in a variant message. For previous discussion on callsets see https://github.com/ga4gh/schemas/issues/583.
+1 this seems like a great improvement after writing some programs that search the calls using the current scheme.
@mbaudis @jeromekelleher @jacmarjorie @diekhans Any thoughts?
This seems well worth trying out (as in +1).
Is there any reason to leave the array of call in a Variant object with this endpoint?
Kevin Osborn [email protected] writes:
+1 this seems like a great improvement after writing some programs that search the calls using the current scheme.
@mbaudis @jeromekelleher @jacmarjorie @diekhans Any thoughts?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.*
@diekhans the calls array is left as is to improve the likelihood of acceptance by easing migration to the new access pattern.
@david4096 - this is a far more intuitive way to query genotype calls. Is variant_set_id to be mandatory or does this allow non-strandard querying across sets? If variant_id is not mandatory, is there any expectation that the calls will be ordered by genomic location? I assume you are aware that @reece and the VMC are looking at this area too.
Thanks @sarahhunt. I've added comments stating the variant_set_id
is required and the other fields are optional. I have not made the expectation that they will be returned by genomic position.
I am excited to see what the VMC comes up with for GA4GH! I see this PR as a simple iterative change from our current access patterns. I believe that adding some notion of Sequence
and/or Allele
and making the variant model unary will further clarify the domain. We need to provide the simple bridge from VCF, and I believe this is PR is a step.
@david4096 will demo this approach in a test program before merging
A very sensible and welcome recommendation!
Thank you and looking forward to the demo, Paul
As per @diekhans comment, I think removing the callSetIds parameter of variants/search is correct.