tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

Class method to create simple VariantData files for demos

Open hyanwong opened this issue 1 year ago • 2 comments

It's pretty helpful for teaching purposes to be able to demonstrate small tsinfer input files without needing a VCF. I wonder if we could define a class method like this:

class VariantData:
    ...
    def from_arrays(
        cls,
        path,
        variant_positions,
        variant_matrix_phased,
        alleles,
        ancestral_allele,
        sample_id=None,
    ):
        """
        Create an VariantData instance directly from array data. Only
        useful for small datasets. Larger datasets should use e.g. bio2zarr
        to create a zarr datastore containing the required data and call
        VariantData(path_to_zarr)

        :param path str: The path used to store the data
        :param variant_positions array: a 1D array of variant positions
        :param variant_matrix_phased array: a 3D array of variants X samples x ploidy,
            giving an index into the allele array for each corresponding variant. Values must
            be coercable into 8-bit (np.int8) integers. Data for all samples is assumed to be phased
        :param alleles array: a 2D string array of variants x max_num_alleles at a site.
            Each allele list for a variant must be the same length, equal to the
            maximum value in the `variant_matrix_phased` array. Each allele list can be
            padded with `""` to ensure the correct length
        :param ancestral_allele array: a 1D string array specifying the ancestral allele
            for each variant. For unknown ancestral alleles, any character which is not
            in the allele list can be used.
        :param sample_id: a 1D string array of sample names. If None, samples
            (each corresponding to `ploidy` variant values) will be allocated the IDs
            "0", "1", "2", .. etc.
        """
        call_genotype=np.array(variant_matrix_phased, dtype=np.int8)
        if sample_id is None:
            sample_id = np.arange(call_genotype.shape[1]).astype(str)
        # assume all phased
        call_genotype_phased = np.ones(call_genotype.shape[:2], dtype=bool)
        zarr.save_group(
            path,
            variant_position=np.array(pos),
            call_genotype=call_genotype,
            call_genotype_phased=call_genotype_phased,
            variant_allele=np.array(alleles),
            variant_ancestral_allele=np.array(ancestral_allele),
            sample_id=sample_id,
        )
        return cls(path)

Then we could use it like this:

# For demo only: larger examples could pass `data` as a large numpy array directly
pos, data, alleles, ancestral = list(zip(*[
    (3,  [[0, 1], [0, 0], [0, 0]], ["A", "T", ""], "A"),
    (10, [[0, 1], [1, 1], [0, 0]], ["C", "A", ""], "C"),
    (13, [[0, 1], [1, 0], [0, 0]], ["G", "C", ""], "-"),
    (19, [[0, 0], [0, 1], [1, 0]], ["A", "C", ""], "A"),
    (20, [[0, 1], [2, 0], [0, 0]], ["T", "G", "C"], "T"),
]))
sd = tsinfer.VariantData.from_arrays("test.vcz", pos, data, alleles, ancestral)
ts = tsinfer.infer(sd)

hyanwong avatar May 23 '24 22:05 hyanwong

We're going to need something like this for testing anyway - even better if it was just an in-memory store.

jeromekelleher avatar May 23 '24 22:05 jeromekelleher

This could also be an alternative to https://github.com/sgkit-dev/bio2zarr/issues/232, as a method for getting a tree sequence into VariantData format. It would be useful to have an in-memory way to feed the data from a tree sequence into tsinfer, which I think is part of https://github.com/tskit-dev/tsinfer/issues/783

hyanwong avatar Aug 30 '24 09:08 hyanwong