tsinfer
tsinfer copied to clipboard
Class method to create simple VariantData files for demos
It's pretty helpful for teaching purposes to be able to demonstrate small tsinfer input files without needing a VCF. I wonder if we could define a class method like this:
class VariantData:
...
def from_arrays(
cls,
path,
variant_positions,
variant_matrix_phased,
alleles,
ancestral_allele,
sample_id=None,
):
"""
Create an VariantData instance directly from array data. Only
useful for small datasets. Larger datasets should use e.g. bio2zarr
to create a zarr datastore containing the required data and call
VariantData(path_to_zarr)
:param path str: The path used to store the data
:param variant_positions array: a 1D array of variant positions
:param variant_matrix_phased array: a 3D array of variants X samples x ploidy,
giving an index into the allele array for each corresponding variant. Values must
be coercable into 8-bit (np.int8) integers. Data for all samples is assumed to be phased
:param alleles array: a 2D string array of variants x max_num_alleles at a site.
Each allele list for a variant must be the same length, equal to the
maximum value in the `variant_matrix_phased` array. Each allele list can be
padded with `""` to ensure the correct length
:param ancestral_allele array: a 1D string array specifying the ancestral allele
for each variant. For unknown ancestral alleles, any character which is not
in the allele list can be used.
:param sample_id: a 1D string array of sample names. If None, samples
(each corresponding to `ploidy` variant values) will be allocated the IDs
"0", "1", "2", .. etc.
"""
call_genotype=np.array(variant_matrix_phased, dtype=np.int8)
if sample_id is None:
sample_id = np.arange(call_genotype.shape[1]).astype(str)
# assume all phased
call_genotype_phased = np.ones(call_genotype.shape[:2], dtype=bool)
zarr.save_group(
path,
variant_position=np.array(pos),
call_genotype=call_genotype,
call_genotype_phased=call_genotype_phased,
variant_allele=np.array(alleles),
variant_ancestral_allele=np.array(ancestral_allele),
sample_id=sample_id,
)
return cls(path)
Then we could use it like this:
# For demo only: larger examples could pass `data` as a large numpy array directly
pos, data, alleles, ancestral = list(zip(*[
(3, [[0, 1], [0, 0], [0, 0]], ["A", "T", ""], "A"),
(10, [[0, 1], [1, 1], [0, 0]], ["C", "A", ""], "C"),
(13, [[0, 1], [1, 0], [0, 0]], ["G", "C", ""], "-"),
(19, [[0, 0], [0, 1], [1, 0]], ["A", "C", ""], "A"),
(20, [[0, 1], [2, 0], [0, 0]], ["T", "G", "C"], "T"),
]))
sd = tsinfer.VariantData.from_arrays("test.vcz", pos, data, alleles, ancestral)
ts = tsinfer.infer(sd)
We're going to need something like this for testing anyway - even better if it was just an in-memory store.
This could also be an alternative to https://github.com/sgkit-dev/bio2zarr/issues/232, as a method for getting a tree sequence into VariantData format. It would be useful to have an in-memory way to feed the data from a tree sequence into tsinfer, which I think is part of https://github.com/tskit-dev/tsinfer/issues/783