hts-specs
hts-specs copied to clipboard
htsget VCF fields specification
As briefly discussed in https://github.com/samtools/hts-specs/pull/385, there seems to be a need to specify which VCF fields to include in the htsget spec, since only BAM fields are mentioned under the /reads
API endpoint.
A prior PR on htsget VCF support didn't cover the issue so that's why I'm rising it here, as prompted by @jmarshall. As part of the OpenAPI3 htsget 1.1.1 spec, I've tentatively included those:
enum:
- "INFO"
- "SAMPLE"
- "FILTER"
- "FORMAT"
- "ALT"
But I'm pretty sure this attempt falls short and/or it doesn't satisfy common nor future use cases. Feedback welcome.
/cc @amilamanoj @jeromekelleher @ohofmann
Wearing my opinionated hacker hat: I've personally been against htsget servers being responsible for filtering these individual fields because it leads to a need for complex scale-out backends in order to service large workloads. With just range slicing, it's possible to implement an htsget server that does not actually have to process or transport any of the real data in-band, if the files are reachable via HTTPS. This is what htsnexus tries to illustrate although in doing so, it takes it to an extremist extent.
However, as a spec maintainer of course I will approach discussion on this neutrally to the best of my ability :smile:
+1 on your opinionated hat :)
On the plus side of removing those filtering bits (perhaps also the BAM ones?) from the spec is that it'll all be more like a plain simple http client/server, so less custom code to put on both ends.
How fields
is supposed to work is pretty underspecified in the spec. I assume it's supposed to be an interface to CRAM implementations' ability to avoid decoding specified column blocks and/or to database-based servers' ability to pick and choose which fields to supply.
When returning SAM/BAM this would look like the non-requested fields had decayed to 0
/*
, and when returning CRAM perhaps those blocks could be omitted from the returned stream. Servers that didn't support it would just return the full SAM/BAM/CRAM and clients might see that the non-requested columns were not 0
/*
but that would be okay because the client is ignoring them anyway.
For VCF we could say that fields
should not be specified for the variants endpoint. Or we could list all the fields as possibilities to be dropped (to .
), so that database-based servers could implement this functionality — on the understanding that file-based servers are just going to ignore it and fill in those fields anyway.