tsinfer
tsinfer copied to clipboard
SGkit equivalent for tutorial read_vcf
I think the read_vcf function for real data in the tutorial should now go something like this (using SGKit)
from sgkit.io.vcf import vcf_to_zarr
import sgkit as sg
import tsinfer
import numpy as np
from dask.diagnostics import ProgressBar
with ProgressBar(): # will only show a progress bar for long conversions
vcf_to_zarr(
"https://github.com/tskit-dev/tsinfer/raw/main/docs/_static/P_dom_chr24_phased.vcf.gz",
"output.zarr",
fields = ["INFO/*", "FORMAT/*"], # saves the AA field for ancestral alleles into variant_AA
)
# load and specify the ancestral allele for tsinfer
ds = sg.load_dataset("output.zarr")
# Some VCFs have "|" in the Ancestral Allele field: take the bit before the "|"
ancestral = np.char.partition(ds.variant_AA.astype(str), sep="|")
ds.update({'variant_ancestral_allele': ancestral[:,0]})
n_char_rem = np.char.str_len(np.char.strip(ancestral, " |"))[:,2]
if np.any(n_char_rem > 0):
raise ValueError(f"Multiple ancestral alleles at sites {np.where(n_char_rem)[0]}")
# Could also check here that the ancestral alleles are sensible
# Append only the new variables to the existing dataset (yuck)
sg.save_dataset(ds.drop_vars(set(ds.data_vars)- {"variant_ancestral_allele"}), "output.zarr", mode="a")
sd = tsinfer.SgkitSampleData(path="output.zarr")
ts = tsinfer.infer(sd)
The ds.drop_vars(set(ds.data_vars)- {"variant_ancestral_allele"}, mode="a" line is pretty horrible to explain, though. Is there a better way to append an array to an existing zarr file? (edit: issue found at https://github.com/pystatgen/sgkit/discussions/1000 - looks like we need to wait for an SGkit update)
I'm not sure how to get the metadata into the sample data file yet, though
Also need to add checks on allele types, and maybe to enforce uppercase alleles in the variant array (or somehow ensure that if the ancestral allele case ("A") is different to the case in the call_genotype array ("a") is isn't treated as a different allele.
Hopefully this would get picked up when the routine spits out 'couldn't find "A" in alleles list'
We can test on e.g. VCFs from 1000G (https://github.com/tskit-dev/tsinfer/issues/683#issuecomment-1163465893), using e.g.
from dask.diagnostics import ProgressBar
with ProgressBar():
vcf_to_zarr(
"Downloads/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz",
"output.zarr",
regions=["21:1-10000000"], # it takes a long time to read a big VCF, so only read the first 10Mb
fields = ["INFO/AA", "FORMAT/GT"], # saves the AA field for ancestral alleles into variant_AA
)
If we want to read in the 1000G directly, we need a way to mask out duplicate positions: https://github.com/pystatgen/sgkit/issues/1174
Let't not go overboard with documenting this yet, the VCF code based on Dask for sgkit has been very problematic and I'm looking into replacing it.
Thank, useful to know. It's worth keeping this issue open as a potential example to modify, I guess.
And, just for reference, as long as we mask out the duplicate positions, we can complete a round of tsinfer on the standard 1000G dataset (with ancestral alleles encoded in the VCF), like this. This will need small changes if/when the vcf-reading path in sgkit changes (see @jeromekelleher 's comment above)
from sgkit.io.vcf import vcf_to_zarr
import sgkit as sg
import tsinfer
import numpy as np
from dask.diagnostics import ProgressBar
with ProgressBar():
vcf_to_zarr(
"/Users/yan/Downloads/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz",
"output.zarr",
regions=["21:1-15000000"],
fields = ["INFO/AA", "FORMAT/GT"], # saves the AA field for ancestral alleles into variant_AA
)
# load and specify the ancestral allele for tsinfer
ds = sg.load_dataset("output.zarr")
# Mask sites which have duplicate or out-of-order site positions
variant_mask = np.diff(ds.variant_position.values, prepend=-1) > 0
if np.any(np.diff(ds.variant_position.values[variant_mask] <= 0)):
raise ValueError("Some variants are out of order in the VCF file")
else:
print(f"Masked {np.sum(variant_mask == False)} variants at duplicate positions")
ds.update({'variant_mask': variant_mask})
ancestral = np.char.partition(ds.variant_AA.astype(str), sep="|")
ds.update({'variant_ancestral_allele': ancestral[:,0]})
# Append only the new variables to the existing dataset
sg.save_dataset(ds.drop_vars(set(ds.data_vars)- {"variant_ancestral_allele", "variant_mask"}), "output.zarr", mode="a")
sd = tsinfer.SgkitSampleData(path="output.zarr")
ts = tsinfer.infer(sd, progress_monitor=True)