tsinfer
tsinfer copied to clipboard
How to save an SgkitSampleData instance, e.g. for running the CLI
It appears as if it's not possible to save an SgkitSampleData instance to a path. I'm not sure, therefore, how I might run the CLI on an zarr file, if I've specified bespoke masks, ancestral alleles, or wherever via numpy arrays (see #923)
I guess the easy way around this is to have a function that saves all the information such as bespoke masks / ancestral alleles into the zarr file (or make a copy of it if the original zarr is read-only?), and allow the CLI to run directly on that modified zarr:
tsinfer infer demo.vcz -O demo.trees
A more complex possibility for the user is to have the CLI accept the same parameters as tsinfer.SgkitSampleData(...), but then we might want to allow either numpy files or names in the .vcz file, which seems a bit icky, e.g.
tsinfer infer demo.vcz --variant_mask my_vmask_file.npy --ancestral_allele variant_AA -O demo.trees
I guess another possibility would be to provide an input CSV file (or Zarr) which is formatted with position, ancestral_allele, variant_age etc, which then populates the arrays appropriately.
Good point. This could even be a .npz file with the appropriately named variables. I guess that's basically the same as another zarr file. Probably best if @benjeffery weighs in with what he thinks would work best.
Here's a snippet from the docs I am trying to write
vcf2zarr explode demo.vcf.gz /tmp/demo.exploded vcf2zarr encode /tmp/demo.exploded demo.vcz # Here: how can I specify the ancestral state via a simple CLI command??? tsinfer generate-ancestors demo.vcz # saves to demo.ancestors tsinfer match-ancestors demo.vcz # saves to demo.ancestors.trees tsinfer match-samples demo.vcz # saves to demo.treesTo parallelise the vcf2zarr steps you may wish to explore the
--worker-processesoption, or even split over partitions. To parallelise the tsinfer steps you may wish to investigate the--progressand--num_threadsoptions.
Thoughts about how to specify on the command-line to use variant_AA as ancestral state, given a demo VCF with ancestral alleles embedded in the AA field, would be most welcome.
A JSON or yaml config file specifying inference parameters?