tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

Extra sample nodes being created from .vcz file when ploidies are uneven

Open hyanwong opened this issue 1 year ago • 10 comments

If we have a mix of ploidies in a .vcz file, we create sample nodes for the maximum ploidy size for each individual, that just attach to the root (because they have no data). I.e. we make a huge polytomy at the root of "fake" samples. I'm not even sure how we identify these samples to delete them.

There are a couple of people at the Oslo workshop who are working on haplodiploids, and others who are working on mixed policy systems in plants (e.g. @situssog), so this would be good to fix.

Here's an example:

import tempfile
import os
import sys
import subprocess
import msprime
import pysam
from Bio import bgzf


# make some haploids, diploids, and tetraploids
ts = msprime.sim_ancestry(
    [msprime.SampleSet(2, ploidy=1), msprime.SampleSet(3, ploidy=2), msprime.SampleSet(1, ploidy=4)],
    population_size=1e3,
    sequence_length=1e4,
    recombination_rate=1e-6,
)
ts = msprime.sim_mutations(ts, rate=1e-7)
zarr_file_name = "tmp.vcz"
with tempfile.TemporaryDirectory() as tmpdirname:
    vcf_name = os.path.join(tmpdirname, zarr_file_name.replace("/", "") + ".vcf.gz")
    with bgzf.open(vcf_name, "wt") as f:
        ts.write_vcf(f)
    pysam.tabix_index(vcf_name, preset="vcf")
    subprocess.run([sys.executable, "-m", "bio2zarr", "vcf2zarr", "convert", "--force", vcf_name, zarr_file_name])

vdata = tsinfer.VariantData("tmp.vcz", ancestral_allele=np.array([s.ancestral_state for s in ts.sites()]))
inferred_ts = tsinfer.infer(vdata)
print("ts.num_samples:", ts.num_samples, "-- inferred_ts.num_samples:", inferred_ts.num_samples)

hyanwong avatar Aug 20 '24 15:08 hyanwong

Yes, mixed ploidy isn't supported for now. I guess detecting the not-present samples and then masking them out is the only way to do this.

benjeffery avatar Aug 21 '24 00:08 benjeffery

Ah yes: https://sgkit-dev.github.io/sgkit/0.5.0/vcf.html#polyploid-and-mixed-ploidy-vcf. It looks like vcf2zarr doesn't have a mixed_ploidy option like sgkit.io.vcf.vcf_to_zarr does. I assume this is something we eventually plan to incorporate? And in the same way, we would set a sentinel of -2 for each genotype call?

I guess detecting the not-present samples and then masking them out is the only way to do this.

Do you mean the only way to do this at the moment, or the only foreseeable way forward in the long term?

hyanwong avatar Aug 22 '24 10:08 hyanwong

Mixed_ploidy isn't a property of the dataset, and was introduced in sgkit as a way of forcing calls to appear as if they are not mixed ploidy (from the output of certain variant callers on human datasets). I think we would need to tell tsinfer explicitly what ploidy to expect for each sample to support this properly.

We could add a ploidy= argument in SampleData, which could be a single number or an array (one value for each sample)?

jeromekelleher avatar Aug 27 '24 10:08 jeromekelleher

The .vcz file "knows" the ploidy in each case, though, as it has a -2 in the right place. E.g. here is the call_genotype array for the example above:

array([[[ 1, -2, -2, -2],
        [ 1, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1,  0,  0]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 1,  0, -2, -2],
        [ 0,  0, -2, -2],
        [ 1,  0,  0,  1]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0,  0,  0]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 1,  1, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1, -2, -2],
        [ 0,  0,  0,  1]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0,  0,  1]]], dtype=int8)

Is it not just a question of the VariantData object correctly inferring the ploidy of each sample from the -2 values?

hyanwong avatar Sep 01 '24 21:09 hyanwong

That involves a full scan of the data, though, as you have to examine every element of the genotype matrix to find any -2s. That's definitely not an operation we want to be done "implicitly".

It's not that big a chore for a user who already knows the ploidy of their samples to supply it as an argument.

jeromekelleher avatar Sep 02 '24 08:09 jeromekelleher

We could have a flag in VariantData to say to "check the first variant for non--2 values and use that as the ploidy"? I don't think -2 is used for other stuff, is it (missing data is -1). The users I was speaking to will probably just be pulling in VCFs from somewhere else, and I suspect it would be a tedious extra step to find the sample IDs and figure out the ploidy, when it's all in the VCF anyway.

Some other options:

  1. a VariantData function, check_ploidy that "fixes" the problem on an optional basis
  2. some sort of check when we do go though the variants (e.g. when building ancestors or matching samples) that the ploidies are correct?

hyanwong avatar Sep 02 '24 09:09 hyanwong

Another option, which makes @jeromekelleher's suggestion easier:

  1. A function which extracts the ploidies for all samples from a single variant, which could then be used as the input to the VariantData constructor, without having to figure out sample IDs etc.
data = tsinfer.VariantData(
    "tmp.vcz",
    ploidies = tsinfer.ploidies_from_vcz("tmp.vcz", variant_id=0)
)

hyanwong avatar Sep 02 '24 09:09 hyanwong

Option 3 is straightforward enough all right.

jeromekelleher avatar Sep 02 '24 10:09 jeromekelleher

It might also be a good idea to warn on creation of the VariantData object if the first variant has any -2 values and a ploidy argument is not specified?

hyanwong avatar Sep 02 '24 11:09 hyanwong

Can't do it on create - it involves reading all values and can take 10s of minutes.

jeromekelleher avatar Sep 02 '24 12:09 jeromekelleher