tsinfer
tsinfer copied to clipboard
Variant method to return actual states
@szhan and others are finding it pretty inconvenient to deal with mismatching between the underlying integers in the genotypes array returned by sample_data.variants() and ts.variants(). How about if we provided a method on the two variant classes to return the encoded variation as a numpy string array. It would be inefficient for large-scale stuff, but I think it might save many errors in smaller-scale testing, etc. Something like the following would probably work for SampleData instances, and an equivalent function could be created for tskit variants. Hopefully making it a function would make it clear to the user that a potentially inefficient calculation was going on under the hood.
@attr.s
class Variant:
"""
A single variant. Mirrors the definition in tskit.
"""
site = attr.ib()
genotypes = attr.ib()
alleles = attr.ib()
def genotypes_as_strings(self):
"""
Returns the variants at this site as an array of strings: Note, however, that it is
much more efficient to work with the underlying integer representation as
returned by the ``.genotypes`` property.
"""
return np.array(alleles)[genotypes]
This is useful all right, we use something like this in a bunch of places.
Adding to 0.3.1 as this is a trivial but useful addition
+1 on this as I got very confused writing the sgkit ancestral allele tests.
Over in https://github.com/tskit-dev/tskit/pull/2617 @jeromekelleher suggested we call this method .states()