hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

htsget should require that returned SAM/VCF headers include `@SQ`/`##contig`

Open brainstorm opened this issue 3 years ago • 14 comments

See https://github.com/igvteam/igv.js/issues/1187 for context.

TL;DR:

The basic problem is there is no way to discover what referenceNames are present in a dataset, and calling the service with a wrong reference name is a 400 error! This is a very nasty thing to do to clients. As everyone knows there are 2 common reference naming conventions, "chr1" and "1", and using the "wrong" one WRT any given dataset will throw an error. Ditto, I assume, querying for a genomic range that is outside of the dataset. This is not an error IMO, there's simply nothing "there" so an object noting an empty result should be returned.

/cc @mlin

brainstorm avatar Jun 12 '21 11:06 brainstorm

The basic problem is there is no way to discover what referenceNames are present in a dataset

As noted in https://github.com/samtools/hts-specs/pull/311#issue-187673445 which motivated their addition, class=header requests provide this referenceName discovery mechanism.

jmarshall avatar Jun 12 '21 12:06 jmarshall

@jmarshall The "class-header" option works great for reads, thanks for that, but how does it work for variants? AFAIK the only required line in a VCF header is the format directive, and I don't recall seeing a VCF with referenceNames listed in the header.

jrobinso avatar Jun 12 '21 15:06 jrobinso

The relevant VCF header is ##contig (§1.4.7 in the v4.3 spec), which “is recommended for VCF, and required for BCF”. In my experience, it's pretty commonly supplied in the wild (though perhaps with no more additional subfields than just ##contig=<ID=chr1>).

I think there has been talk of making ##contig headers required in future VCF versions, but I can't find any reference to that in hts-specs issues or PRs.

You are correct that it would behoove htsget to specify that the VCF headers returned by a class=header request (or perhaps any variant request) SHOULD/MUST include ##contig headers, even if VCF itself does not require that in general.

IMHO that would be a more practical approach than inventing another endpoint and requiring servers to divine this information from the underlying data in some other way than looking at these headers. (~~Note also that something similar applies for reads: in SAM (only) it's valid for @SQ headers to be omitted, though it's very uncommon as e.g. samtools barely supports it.~~ Htsget returns only in BAM or CRAM, so no issue due to lack of @SQ text headers arises.)

jmarshall avatar Jun 12 '21 15:06 jmarshall

@jmarshall Thanks, that would work for me (##contig). I just need the names, so even the short form would work. I will build in workarounds if they are missing, which obviously I have already done for alignments.

jrobinso avatar Jun 12 '21 16:06 jrobinso

@jmarshall BTW servers must already be "divining" the reference names in a dataset since they are supposed to return an error if a request is made for a reference name that is not present. This was at the root of my requests from the beginning, if you are going to return an error shouldn't there be a way to determine what is legal in advance? In any event I have long implemented workarounds, even without the "header" option, so you don't need to keep this open for me. Thanks for your responses.

jrobinso avatar Jun 12 '21 16:06 jrobinso

BTW servers must already be "divining" the reference names in a dataset since they are supposed to return an error if a request is made for a reference name that is not present.

File-based servers will typically be using a bai/csi/tbi index to return results for requested referenceNames and coordinates; doing such an individual lookup by name is a typical operation on such an index. Enumerating all the available referenceNames is not a typical operation on these indexes (and is not possible for some indexes), so would require additional implementation. This is what I mean by additional divination being required.

Anyway, for the 1 vs chr1 problem, the real solution is for VCF to gain an equivalent of SAM's @SQ-AN field and for both to become widely deployed…

This was at the root of my requests from the beginning, if you are going to return an error shouldn't there be a way to determine what is legal in advance?

As you know, there has been a way to determine that for a little over two years.

jmarshall avatar Jun 15 '21 22:06 jmarshall

@jmarshall It's been more than 2 years since I last worked on this until recently, so I'm getting back up to speed. Yes there is a way for BAM files, but not for VCF, the optional ##contig header not withstanding. Currently I'm handling it this way, if there's a mismatch on a VCF file that lacks the ##contig header too bad, its just going to be an error.

jrobinso avatar Jun 15 '21 23:06 jrobinso

There's a way for BAM files, but not for VCF

Please be explicit. What do you consider is lacking for VCF? (Other than the minor “my VCF file does not have ##contig headers” issue that this issue now represents. I believe such VCF files are unusual; the common case is for these headers to be present.)

jmarshall avatar Jun 15 '21 23:06 jmarshall

@jmarshall Nothing, that's it, the optional ##contig header. I'm sure you're right, they are unusual, unusual files are way over represented in my world because they generally end up as IGV help tickets. In this case I'll wait for such a ticket and just assume ##contig will always be there.

jrobinso avatar Jun 15 '21 23:06 jrobinso

Totally confused about what issue I'm responding to via email, sorry for confusion. All is good for now, if htsget is extended to other formats this might need revisited.

jrobinso avatar Jun 15 '21 23:06 jrobinso

Header requests work whether or not the files contain @SQ/##contig headers; they just may not supply the information the client is looking for, but that can be considered to be not htsget's fault. So IMHO htsget should not make hard requirements around the files it serves, but it would be useful to clarify all this by adding some (non-normative) text expanding on what header requests can be used for — e.g. something like

Requests for headers return a data stream containing read or variant headers in the format specified by format, if present, or in the respective default format. They can, for example, be used to discover the list of referenceName values applicable for this data stream, provided the data stream contains @SQ or ##contig headers.

jmarshall avatar Jul 07 '21 13:07 jmarshall

Just as a random context, in non-htsget with tabix'ed VCF files, we used the names inside the VCF tabix file for discovery of reference sequence names. Indexes would not be available in htsget of course, but it would be nice to have some expectation (if not guarantee) that the refnames would be in a header request

cmdcolin avatar Jul 07 '21 14:07 cmdcolin

It would be possible to invent another form of request (e.g. class=referenceNames?) that would return the names as divined by some other method such as from tabix indexes. But htsget is parsimonious and tries to avoid invention, hence used the already-present data stream headers to enable referenceName discovery to be done, as that was considered to be the minimal design.

The reasons for adding text like that are

  1. Clarify to authors of clients that the response is in format not just textual SAM (but see also #580)
  2. Jog readers' memory about ##contig headers providing this information in VCF
  3. Strongly hint to the operators of htsget servers that the variant data streams presented by their services ought to include ##contig headers.

Item 3 represents the expectation you mention; YMMV how best to express that in spec text.

jmarshall avatar Jul 07 '21 14:07 jmarshall

I'm imagining two forms of 400 error that this could be applied to:

  • InvalidHeaderFormat: header does not include information about contigs or references
  • InvalidReferenceName: requested reference name is not included in <pysam.VariantFile.VariantHeader.contigs or similar for alignments>

I can't find any indication that there is a easy way to access just the reference_names from an AlignmentHeader, but it would presumably be easy to implement in any htsget server.

daisieh avatar Sep 16 '21 05:09 daisieh