tskit
tskit copied to clipboard
Variant method to get all alleles as a string
There are often times when we'd like to get the genotypes for a given site as the actual alleles rather than the indexes into the alleles list. (This is what #2168 was about, I assume.)
We can add this as a method to the Variant class easily enough. We can just raise an error if there are any non-1 length alleles in there.
What do we call it?
This is the replacement also for the old as_bytes option, which basically did the same thing.
It might look something like
def genotypes_as_alleles(self): -> str
"""
Returns a string s in which s[j] is the value of ``var.alleles[var.genotypes[j]]``. Raises an error if
all alleles are not of length 1.
"""
Note that some of the changes made to the test suite for #2172 can be made simpler using this functionality.
Might it be nice to have a general method:
tskit.genotypes_as_alleles(alleles, genotypes), tskit.genotypes_as_ragged_alleles(alleles, genotypes) then
Variant.genotypes_as_alleles would use this?
Good idea - I guess the distinction is whether you want to returned value to be a numpy array or a string. There's definitely value in getting a numpy array back too.
@hyanwong - any thoughts on names here? I feel like this is quite a handy feature and we should support "encoding" the variant data in string or numpy format. We definitely shouldn't call the it "encode" though, as we already have the "decode" method which does something quite different.
The reason I wanted this was to be able to compare sites from different encodings of the same tree sequence (which therefore could have alleles in a different order. I "hacked" around this in https://github.com/tskit-dev/tsinfer/issues/652#issuecomment-1084693017 by using ts.genotype_matrix(alleles=tskit.ALLELES_ACGT), but it isn't an obvious thing to do, and I can imagine it catching out other people who want to compare the sites-by-samples matrix, or do sitewise comparisons. So I think this is handy, yes. Another possibility for my use case would be the ability to compare the variant genotypes with each other, so that you could do (say) the equivalent of ts1.variant(4).identical_genotypes(ts2.variant(4)) or similar notation.
Re naming: genotypes_as_alleles does what it says on the tin, so that seems OK. My only reservation is that it might be a bit cryptic to those not well versed in the tskit distinction between "genotype" (an index value) and "alleles" (the list being indexed). Also note that we are using "genotype" in a rather nonstandard way here, because that's usually considered an unphased property in a diploid (your genotype can be homozygous A, heterozygous, or homozygous B).
By the way, you say "encode", but I wonder if most people would think of this as a decoding of the indexing scheme? I can't think of any brilliant alternatives which avoid "encode/decode", but here are some ideas: tskit.translate_genotypes(a, g), tskit.true_genotypes(a, g), tskit.site_variation(a, g), tskit.allelic_variation(a, g), tskit.allele_states(a, g), tskit.allele_values(a, g), tskit.genotype_values(a, g).
p.s. a numpy array would be great: then you wouldn't need to bomb out on non-single-character alleles, I guess?
This is what https://github.com/tskit-dev/tskit/pull/2617 is about