hts-specs Define Local Alleles in VCF to allow for sparser format

This is the local-alleles part of #420

Jul 24 '19 18:07 yfarjoun

Changed PDFs as of 970a87b: VCFv4.3 (diff).

Jul 24 '19 18:07 hts-specs-bot

Changed PDFs as of 4bc4299: VCFv4.3 (diff).

Jul 24 '19 20:07 hts-specs-bot

Hi all, I was motivated to come back here after another collaborator just run aground on the htslib int overflow when it tries to allocate a 2GB+ PL vector :smile: :sob:

It seems like we are not far; the open items are

Must the LAA allele indices for each sample be strictly increasing? I advocate yes, to make reading LPL less tricky, but the trickiness burden is shifted to the writer.
Precise language for specifying the order of LPL values, if anyone agrees with me it's worthwhile to leave no doubt.
Move the item describing LAA, LAD, and LPL into alphabetical position within the itemized list. (+ misc wording and punctuation edits)

Sep 08 '19 08:09 mlin

I think this is a nice new feature, but it took me many minutes of very close reading to figure out just what the feature is! If I understand it correctly, it could perhaps be better described as:

The table of reserved keys is just a brief taster / memory jogger of what each field is about, so: LAA List of relevant Local Alternate Alleles / LAD Read depth for each local allele / LPL Phred-scaled local genotype likelihoods rounded to the closest integer
The text in the bulleted paragraph motivating this doesn't mention that the point of all this is to only list “interesting” alleles on a per-sample basis because each sample has a different small subset of alt alleles that are present for that sample. The text suggested in https://github.com/samtools/hts-specs/pull/434#discussion_r307968349 is rather better.
It needs a complete example so you can see what is happening. Something like

… REF  ALT        …  FORMAT          SAMPLE1                       SAMPLE2
… G    A,C,T,<*>  …  GT:LAA:LAD:LPL  1/1:2,4:20,30,10:1,2,3,4,5,6  1/1:3:15,25:1,2,3

thus demonstrating the different subsets per sample.

The numbers are all represented as ., but the paragraph text should explain the relationships between them. So:

LAA is a list of n integers, where 1 <= n <= the number of ALT alleles, giving the (1-based) indices within ALT of the alleles that are observed in the sample. [And a sentence about the ordering if so decided.] (Uh — can n be 0?)

LAD is a list of n+1 integers giving read depths (as per AD) for the REF allele and each of the local alleles as listed in LAA.

LPL is a list of n+1 choose P (or whatever the combinatoric calculation is!) integers giving phred-scaled genotype likelihoods (rounded to the closest integer; as per PL) for all possible genotypes given the set of alleles defined in the REF and LAA local alleles. [And refer to the precise ordering defined in the GL paragraph.]

(Writing out all that for each field might mean they would be better as separate paragraphs (as per https://github.com/samtools/hts-specs/pull/434#discussion_r308164979) or it might be better to keep them together to collaborate on the same example better…)

Sep 09 '19 15:09 jmarshall

Changed PDFs as of 4f131de: VCFv4.3 (diff).

Sep 16 '19 15:09 hts-specs-bot

Changed PDFs as of 0c65b48: VCFv4.3 (diff).

Oct 07 '19 16:10 hts-specs-bot

Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference.

This is a good point; we should probably clarify how one should write LAA & LPL for samples reporting no evidence for any particular alternate allele.

Oct 07 '19 16:10 mlin

Changed PDFs as of 605e8ec: VCFv4.3 (diff).

Oct 07 '19 17:10 hts-specs-bot

Changed PDFs as of fbf87d3: VCFv4.3 (diff).

Oct 07 '19 18:10 hts-specs-bot

Some misc things I've been thinking about while implementing this:

./. calls are tricky to convert to local alleles in a joined gvcf because we don't know which subset of alleles is relevant, in worst case you may have to include every allele in LAA or analyze the PL's to see which ones are not actually meaningful.
Requiring LAA to be sorted is annoying when merging files.
Would it be a good idea to allow leaving out LAA in when the local alleles include all of the alleles at the site? Benefits: potentially space saving, makes single sample files in local allele format cleaner. Cons: easier to make mistakes when merging files.

Dec 02 '19 20:12 lbergelson

Re (3) Would it be a good idea to allow leaving out LAA in when the local alleles include all of the alleles at the site?

Wouldn't sites where all samples include all the alleles just use GT/AD/PL rather than LAA/LGT/LAD/LPL? (And this would obviously be allowed as there is no text saying a file has to have all or none of its variant lines using LAA, and the default assumption would be that a mixture is fine.)

Or do you mean: when some samples include all the alleles and others don't, instead of writing 1,2,3,…,N for LAA for the all-alleles samples perhaps allow some special value to be an abbreviation for that? e.g. . but frankly that would be an unexpected meaning for .…

Dec 05 '19 10:12 jmarshall

Changed PDFs as of a5d41dc: VCFv4.3 (diff).

Aug 11 '20 17:08 hts-specs-bot

hts-specs hts-specs copied to clipboard

Define Local Alleles in VCF to allow for sparser format

hts-specs
hts-specs copied to clipboard