hts-specs
hts-specs copied to clipboard
Define Local Alleles in VCF to allow for sparser format
This is the local-alleles part of #420
Hi all, I was motivated to come back here after another collaborator just run aground on the htslib int overflow when it tries to allocate a 2GB+ PL vector :smile: :sob:
It seems like we are not far; the open items are
- Must the LAA allele indices for each sample be strictly increasing? I advocate yes, to make reading LPL less tricky, but the trickiness burden is shifted to the writer.
- Precise language for specifying the order of LPL values, if anyone agrees with me it's worthwhile to leave no doubt.
- Move the item describing LAA, LAD, and LPL into alphabetical position within the itemized list. (+ misc wording and punctuation edits)
I think this is a nice new feature, but it took me many minutes of very close reading to figure out just what the feature is! If I understand it correctly, it could perhaps be better described as:
-
The table of reserved keys is just a brief taster / memory jogger of what each field is about, so: LAA List of relevant Local Alternate Alleles / LAD Read depth for each local allele / LPL Phred-scaled local genotype likelihoods rounded to the closest integer
-
The text in the bulleted paragraph motivating this doesn't mention that the point of all this is to only list “interesting” alleles on a per-sample basis because each sample has a different small subset of alt alleles that are present for that sample. The text suggested in https://github.com/samtools/hts-specs/pull/434#discussion_r307968349 is rather better.
-
It needs a complete example so you can see what is happening. Something like
… REF ALT … FORMAT SAMPLE1 SAMPLE2
… G A,C,T,<*> … GT:LAA:LAD:LPL 1/1:2,4:20,30,10:1,2,3,4,5,6 1/1:3:15,25:1,2,3
thus demonstrating the different subsets per sample.
- The numbers are all represented as
.
, but the paragraph text should explain the relationships between them. So:
LAA is a list of n integers, where 1 <= n <= the number of ALT alleles, giving the (1-based) indices within ALT of the alleles that are observed in the sample. [And a sentence about the ordering if so decided.] (Uh — can n be 0?)
LAD is a list of n+1 integers giving read depths (as per AD) for the REF allele and each of the local alleles as listed in LAA.
LPL is a list of n+1 choose P (or whatever the combinatoric calculation is!) integers giving phred-scaled genotype likelihoods (rounded to the closest integer; as per PL) for all possible genotypes given the set of alleles defined in the REF and LAA local alleles. [And refer to the precise ordering defined in the GL paragraph.]
(Writing out all that for each field might mean they would be better as separate paragraphs (as per https://github.com/samtools/hts-specs/pull/434#discussion_r308164979) or it might be better to keep them together to collaborate on the same example better…)
Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference.
This is a good point; we should probably clarify how one should write LAA & LPL for samples reporting no evidence for any particular alternate allele.
Some misc things I've been thinking about while implementing this:
- ./. calls are tricky to convert to local alleles in a joined gvcf because we don't know which subset of alleles is relevant, in worst case you may have to include every allele in LAA or analyze the PL's to see which ones are not actually meaningful.
- Requiring LAA to be sorted is annoying when merging files.
- Would it be a good idea to allow leaving out LAA in when the local alleles include all of the alleles at the site? Benefits: potentially space saving, makes single sample files in local allele format cleaner. Cons: easier to make mistakes when merging files.
Re (3) Would it be a good idea to allow leaving out LAA in when the local alleles include all of the alleles at the site?
Wouldn't sites where all samples include all the alleles just use GT/AD/PL rather than LAA/LGT/LAD/LPL? (And this would obviously be allowed as there is no text saying a file has to have all or none of its variant lines using LAA, and the default assumption would be that a mixture is fine.)
Or do you mean: when some samples include all the alleles and others don't, instead of writing 1,2,3,…,N
for LAA for the all-alleles samples perhaps allow some special value to be an abbreviation for that? e.g. .
but frankly that would be an unexpected meaning for .
…