tsinfer
tsinfer copied to clipboard
Extra sample nodes being created from .vcz file when ploidies are uneven
If we have a mix of ploidies in a .vcz file, we create sample nodes for the maximum ploidy size for each individual, that just attach to the root (because they have no data). I.e. we make a huge polytomy at the root of "fake" samples. I'm not even sure how we identify these samples to delete them.
There are a couple of people at the Oslo workshop who are working on haplodiploids, and others who are working on mixed policy systems in plants (e.g. @situssog), so this would be good to fix.
Here's an example:
import tempfile
import os
import sys
import subprocess
import msprime
import pysam
from Bio import bgzf
# make some haploids, diploids, and tetraploids
ts = msprime.sim_ancestry(
[msprime.SampleSet(2, ploidy=1), msprime.SampleSet(3, ploidy=2), msprime.SampleSet(1, ploidy=4)],
population_size=1e3,
sequence_length=1e4,
recombination_rate=1e-6,
)
ts = msprime.sim_mutations(ts, rate=1e-7)
zarr_file_name = "tmp.vcz"
with tempfile.TemporaryDirectory() as tmpdirname:
vcf_name = os.path.join(tmpdirname, zarr_file_name.replace("/", "") + ".vcf.gz")
with bgzf.open(vcf_name, "wt") as f:
ts.write_vcf(f)
pysam.tabix_index(vcf_name, preset="vcf")
subprocess.run([sys.executable, "-m", "bio2zarr", "vcf2zarr", "convert", "--force", vcf_name, zarr_file_name])
vdata = tsinfer.VariantData("tmp.vcz", ancestral_allele=np.array([s.ancestral_state for s in ts.sites()]))
inferred_ts = tsinfer.infer(vdata)
print("ts.num_samples:", ts.num_samples, "-- inferred_ts.num_samples:", inferred_ts.num_samples)
Yes, mixed ploidy isn't supported for now. I guess detecting the not-present samples and then masking them out is the only way to do this.
Ah yes: https://sgkit-dev.github.io/sgkit/0.5.0/vcf.html#polyploid-and-mixed-ploidy-vcf. It looks like vcf2zarr doesn't have a mixed_ploidy option like sgkit.io.vcf.vcf_to_zarr does. I assume this is something we eventually plan to incorporate? And in the same way, we would set a sentinel of -2 for each genotype call?
I guess detecting the not-present samples and then masking them out is the only way to do this.
Do you mean the only way to do this at the moment, or the only foreseeable way forward in the long term?
Mixed_ploidy isn't a property of the dataset, and was introduced in sgkit as a way of forcing calls to appear as if they are not mixed ploidy (from the output of certain variant callers on human datasets). I think we would need to tell tsinfer explicitly what ploidy to expect for each sample to support this properly.
We could add a ploidy= argument in SampleData, which could be a single number or an array (one value for each sample)?
The .vcz file "knows" the ploidy in each case, though, as it has a -2 in the right place. E.g. here is the call_genotype array for the example above:
array([[[ 1, -2, -2, -2],
[ 1, -2, -2, -2],
[ 0, 0, -2, -2],
[ 0, 1, -2, -2],
[ 0, 0, -2, -2],
[ 0, 1, 0, 0]],
[[ 0, -2, -2, -2],
[ 0, -2, -2, -2],
[ 0, 0, -2, -2],
[ 1, 0, -2, -2],
[ 0, 0, -2, -2],
[ 1, 0, 0, 1]],
[[ 0, -2, -2, -2],
[ 0, -2, -2, -2],
[ 0, 0, -2, -2],
[ 0, 1, -2, -2],
[ 0, 0, -2, -2],
[ 0, 0, 0, 0]],
[[ 0, -2, -2, -2],
[ 0, -2, -2, -2],
[ 1, 1, -2, -2],
[ 0, 0, -2, -2],
[ 0, 1, -2, -2],
[ 0, 0, 0, 1]],
[[ 0, -2, -2, -2],
[ 0, -2, -2, -2],
[ 0, 0, -2, -2],
[ 0, 0, -2, -2],
[ 0, 0, -2, -2],
[ 0, 0, 0, 1]]], dtype=int8)
Is it not just a question of the VariantData object correctly inferring the ploidy of each sample from the -2 values?
That involves a full scan of the data, though, as you have to examine every element of the genotype matrix to find any -2s. That's definitely not an operation we want to be done "implicitly".
It's not that big a chore for a user who already knows the ploidy of their samples to supply it as an argument.
We could have a flag in VariantData to say to "check the first variant for non--2 values and use that as the ploidy"? I don't think -2 is used for other stuff, is it (missing data is -1). The users I was speaking to will probably just be pulling in VCFs from somewhere else, and I suspect it would be a tedious extra step to find the sample IDs and figure out the ploidy, when it's all in the VCF anyway.
Some other options:
- a VariantData function,
check_ploidythat "fixes" the problem on an optional basis - some sort of check when we do go though the variants (e.g. when building ancestors or matching samples) that the ploidies are correct?
Another option, which makes @jeromekelleher's suggestion easier:
- A function which extracts the ploidies for all samples from a single variant, which could then be used as the input to the VariantData constructor, without having to figure out sample IDs etc.
data = tsinfer.VariantData(
"tmp.vcz",
ploidies = tsinfer.ploidies_from_vcz("tmp.vcz", variant_id=0)
)
Option 3 is straightforward enough all right.
It might also be a good idea to warn on creation of the VariantData object if the first variant has any -2 values and a ploidy argument is not specified?
Can't do it on create - it involves reading all values and can take 10s of minutes.