tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Variant method to get all alleles as a string

Open jeromekelleher opened this issue 3 years ago • 6 comments
trafficstars

There are often times when we'd like to get the genotypes for a given site as the actual alleles rather than the indexes into the alleles list. (This is what #2168 was about, I assume.)

We can add this as a method to the Variant class easily enough. We can just raise an error if there are any non-1 length alleles in there.

What do we call it?

This is the replacement also for the old as_bytes option, which basically did the same thing.

It might look something like

def genotypes_as_alleles(self): -> str
     """
     Returns a string s in which s[j] is the value of ``var.alleles[var.genotypes[j]]``. Raises an error if 
     all alleles are not of length 1.
     """

jeromekelleher avatar Apr 05 '22 12:04 jeromekelleher

Note that some of the changes made to the test suite for #2172 can be made simpler using this functionality.

jeromekelleher avatar Apr 05 '22 12:04 jeromekelleher

Might it be nice to have a general method: tskit.genotypes_as_alleles(alleles, genotypes), tskit.genotypes_as_ragged_alleles(alleles, genotypes) then Variant.genotypes_as_alleles would use this?

benjeffery avatar Apr 05 '22 13:04 benjeffery

Good idea - I guess the distinction is whether you want to returned value to be a numpy array or a string. There's definitely value in getting a numpy array back too.

jeromekelleher avatar Apr 05 '22 13:04 jeromekelleher

@hyanwong - any thoughts on names here? I feel like this is quite a handy feature and we should support "encoding" the variant data in string or numpy format. We definitely shouldn't call the it "encode" though, as we already have the "decode" method which does something quite different.

jeromekelleher avatar Apr 06 '22 08:04 jeromekelleher

The reason I wanted this was to be able to compare sites from different encodings of the same tree sequence (which therefore could have alleles in a different order. I "hacked" around this in https://github.com/tskit-dev/tsinfer/issues/652#issuecomment-1084693017 by using ts.genotype_matrix(alleles=tskit.ALLELES_ACGT), but it isn't an obvious thing to do, and I can imagine it catching out other people who want to compare the sites-by-samples matrix, or do sitewise comparisons. So I think this is handy, yes. Another possibility for my use case would be the ability to compare the variant genotypes with each other, so that you could do (say) the equivalent of ts1.variant(4).identical_genotypes(ts2.variant(4)) or similar notation.

Re naming: genotypes_as_alleles does what it says on the tin, so that seems OK. My only reservation is that it might be a bit cryptic to those not well versed in the tskit distinction between "genotype" (an index value) and "alleles" (the list being indexed). Also note that we are using "genotype" in a rather nonstandard way here, because that's usually considered an unphased property in a diploid (your genotype can be homozygous A, heterozygous, or homozygous B).

By the way, you say "encode", but I wonder if most people would think of this as a decoding of the indexing scheme? I can't think of any brilliant alternatives which avoid "encode/decode", but here are some ideas: tskit.translate_genotypes(a, g), tskit.true_genotypes(a, g), tskit.site_variation(a, g), tskit.allelic_variation(a, g), tskit.allele_states(a, g), tskit.allele_values(a, g), tskit.genotype_values(a, g).

hyanwong avatar Apr 06 '22 09:04 hyanwong

p.s. a numpy array would be great: then you wouldn't need to bomb out on non-single-character alleles, I guess?

hyanwong avatar Apr 06 '22 09:04 hyanwong

This is what https://github.com/tskit-dev/tskit/pull/2617 is about

hyanwong avatar Feb 21 '23 10:02 hyanwong