tsinfer
tsinfer copied to clipboard
Better error message when "ds" passed directly to VariantData
Savita and I just tried this wrong incantation, which gives a cryptic error message about assert self.ploidy == self.data["call_genotype"].chunks[2]
import sgkit
import tsinfer
ds = sgkit.load_dataset("_static/example_data.vcz")
vdata = tsinfer.VariantData(ds, ancestral_state="ancestral_state")
The following do work as intended, however
vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_state="ancestral_state")
# or
z = zarr.open("_static/example_data.vcz")
vdata = tsinfer.VariantData(z, ancestral_state="ancestral_state")
I think that either we should detect that the first passed-in parameter is a sgkit dataset object, and load it properly, or issue a more useful warning?
Hi guys
I've been getting this error message when I try and pass my xarray dataset to tsinfer.
My data come from malariagen_data (see here for the spec and construction).
I appreciate it's probably quite an isolated use case (e.g. most people are going to convert from VCF -> sgkit zarr -> tsinfer variant_data), but it would be fantastic if we could load our data into tsinfer.
I'll have a poke around and see what the actual issue is / if I can force our data to work with tsinfer also, but if there were any quick pointers it would be much appreciated!
Cheers!
I don't think we're doing to support Xarray as an input format ultimately as it pulls in too much complexity. We should definitely provide better error messages though, as the current situation is not at all helpful.
Can you provide the Zarr directly instead @tristanpwdennis, or is there some postprocessing done on the Xarray dataset that you want reflected in the tsinfer input? Would be great to know your use-case better here so we know how to support it.
The API basically reads data from multiple zarrs and concatenates them depending on user queries (e.g. subsetting or selecting specific cohorts/sample sets), so providing the zarr directly is hard.
All of the data are in the correct format (e.g. I think our data are mostly aligned with the vcf_zarr spec), so it's just a matter of reorganising them slightly.
The main issue from the tsinfer side I think is that when I provide the dataset to tsinfer.VariantData, tsinfer throws an error whenever it encounters an xr.DataArray. A way around this could be to check the format, and if DataArray, bring it into the correct format (eg usually a numpy ndarray) and check the dtype - and now I see what you mean here about bringing in too much complexity.
A workable solution for the time being is to write our data to a zarr matching the vcf_zarr spec (easy) and then load that again as input. Would be nice to avoid having to write the data out, but it'll definitely do in the meantime.
In time I think this is probably something I could try and solve by adding a VariantData constructor into the malariagen_data API, rather than for you guys to spend time on what will probably be quite a niche use case.
Hope this all makes sense. Thanks for the great library!