tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

Variant method to return actual states

Open hyanwong opened this issue 3 years ago • 4 comments

@szhan and others are finding it pretty inconvenient to deal with mismatching between the underlying integers in the genotypes array returned by sample_data.variants() and ts.variants(). How about if we provided a method on the two variant classes to return the encoded variation as a numpy string array. It would be inefficient for large-scale stuff, but I think it might save many errors in smaller-scale testing, etc. Something like the following would probably work for SampleData instances, and an equivalent function could be created for tskit variants. Hopefully making it a function would make it clear to the user that a potentially inefficient calculation was going on under the hood.

@attr.s
class Variant:
    """
    A single variant. Mirrors the definition in tskit.
    """
    site = attr.ib()
    genotypes = attr.ib()
    alleles = attr.ib()

    def genotypes_as_strings(self):
        """
        Returns the variants at this site as an array of strings: Note, however, that it is
        much more efficient to work with the underlying integer representation as
        returned by the ``.genotypes`` property.
        """
        return np.array(alleles)[genotypes]

hyanwong avatar Oct 14 '22 15:10 hyanwong

This is useful all right, we use something like this in a bunch of places.

jeromekelleher avatar Oct 14 '22 18:10 jeromekelleher

Adding to 0.3.1 as this is a trivial but useful addition

hyanwong avatar Oct 26 '22 10:10 hyanwong

+1 on this as I got very confused writing the sgkit ancestral allele tests.

benjeffery avatar Jan 24 '23 14:01 benjeffery

Over in https://github.com/tskit-dev/tskit/pull/2617 @jeromekelleher suggested we call this method .states()

hyanwong avatar Jan 24 '23 14:01 hyanwong