Add tskit as import format?
Anyone object to us adding tskit as an import format, so we have an sgkit-tskit repo? I'm happy to do the coding here, and I think it'll be a useful way to crystallise our general data import strategy.
We could also do export for tskit, in principle, using tsinfer but I'm not sure there's much point.
No objections, but would it be possible to have sgkit export in tskit itself? sgkit is a light dependency...
We could, I'm totally open to that. Although I'd need to look carefully at how light it is - we don't currently depend on dask, zarr etc in tskit, which do pull in quite a lot stuff.
Following the discussion in last week's call (#637), I put together some code to convert from tskit TreeSequences to Zarr files following sgkit conventions: https://github.com/pystatgen/sgkit/compare/main...tomwhite:ts_to_zarr. The code is just in the tests directory for the moment.
This should be useful for the work on scaling VCF, although for the moment it only works in sequential mode. It should be possible to write the genotypes in parallel though, chunked by variant and sample.
Since TreeSequences are immutable, and lightweight in terms of memory usage, each chunk being written could have a copy of the TreeSequence. Chunking in the samples dimension should be straightforward, since the variants() method takes a subset of samples.
Chunking in the variants dimension is more of a challenge, since for Zarr we need to have equal-sized chunks. I think this may be possible by creating equal-sized slices of tables.sites.position and using keep_intervals() to create each variant chunk.
I'm not sure where such a tool should live - any thoughts @jeromekelleher?
Thanks a million for this @tomwhite, it's super helpful. I'll take a good look at the code tomorrow.
In the short term, I think the simplest thing is for me to create a standalone "ga4gh-variant-sim" repo or something, where we just put the code to do this. We'll want to add extra fields and stuff that are extrinsic to the tree sequence, so it's as easy just put all the code in one place for now. I might make a start on this tomorrow, and maybe we could add your code for doing the translation to sgkit in there? Over time, we can see what a more mature interface might look like, and perhaps add it to tsconvert.
In terms of parallelism by variants, this is definitely a weakness in the tskit API at the moment, we want to add some way of doing this well.
I think the simplest thing is for me to create a standalone "ga4gh-variant-sim" repo or something, where we just put the code to do this.
Sounds good to me.
In terms of parallelism by variants, this is definitely a weakness in the tskit API at the moment, we want to add some way of doing this well.
BTW I pushed another commit to https://github.com/pystatgen/sgkit/compare/main...tomwhite:ts_to_zarr for reading genotypes from a TreeSequence in parallel using the approach I sketched out above. It passes tests, but I don't know how well it performs for larger datasets. It's probably worth having both the sequential and parallel versions for the moment.
Resurrecting this issue, as it's something that's come up on a number of fronts recently:
- Converting tskit variant data to sgkit is very useful for benchmarking (doing this in the paper repo at the moment in the roundabout and slow way of tskit->vcf->sgkit)
- People are increasingly interested in doing calculations with simulated pedigree data, coming through the tskit "individual table" (See this discussion: https://github.com/tskit-dev/pyslim/discussions/325)
I think it would open up sgkit to a lot of applications in simulation if we had a simple and performant way of converting the full tskit data model to sgkit.
Here's a comment on converting the pedigree model from @timothymillar: https://github.com/tskit-dev/tskit/discussions/2711#discussioncomment-5069895
In practise, I think we can add a sgkit.io.tskit subpackage. The question then becomes whether we add a hard dependency on tskit, or whether we make it an optional dependency, like the other IO packages. FWIW, tskit is as lightweight as we can make it, currently depending only on
jsonschema>=3.0.0
numpy>=1.7
svgwrite>=1.1.10
We don't actually do much with svgwrite, and are planning to remove it at some point. It does look like jsonschema pulls in some additional stuff though:
Installing tskit onto a minimal sgkit install gives:
(tmp-venv) jk@holly$ python3 -m pip install tskit
Collecting tskit
Using cached tskit-0.5.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Requirement already satisfied: numpy>=1.7 in ./tmp-venv/lib/python3.9/site-packages (from tskit) (1.24.4)
Collecting jsonschema>=3.0.0
Using cached jsonschema-4.19.0-py3-none-any.whl (83 kB)
Collecting svgwrite>=1.1.10
Using cached svgwrite-1.4.3-py3-none-any.whl (67 kB)
Collecting jsonschema-specifications>=2023.03.6
Using cached jsonschema_specifications-2023.7.1-py3-none-any.whl (17 kB)
Collecting rpds-py>=0.7.1
Using cached rpds_py-0.9.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Collecting attrs>=22.2.0
Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting referencing>=0.28.4
Using cached referencing-0.30.2-py3-none-any.whl (25 kB)
Installing collected packages: rpds-py, attrs, referencing, jsonschema-specifications, svgwrite, jsonschema, tskit
Successfully installed attrs-23.1.0 jsonschema-4.19.0 jsonschema-specifications-2023.7.1 referencing-0.30.2 rpds-py-0.9.2 svgwrite-1.4.3 tskit-0.5.5
So, probably best to make an optional dependency?
Coming back to this point:
No objections, but would it be possible to have sgkit export in tskit itself? sgkit is a light dependency...
It's not that light, really - this is the minimal install on a fresh venv:
Successfully installed MarkupSafe-2.1.3 asciitree-0.3.3 click-8.1.6 cloudpickle-2.2.1 dask-2023.8.0 dask-glm-0.2.0 dask-ml-2023.3.24 distributed-2023.8.0 entrypoints-0.4 fasteners-0.18 fsspec-2023.6.0 importlib-metadata-6.8.0 jinja2-3.1.2 joblib-1.3.2 llvmlite-0.40.1 locket-1.0.0 msgpack-1.0.5 multipledispatch-1.0.0 numba-0.57.1 numcodecs-0.11.0 numpy-1.24.4 packaging-23.1 pandas-2.0.3 partd-1.4.0 psutil-5.9.5 python-dateutil-2.8.2 pytz-2023.3 pyyaml-6.0.1 scikit-learn-1.3.0 scipy-1.11.1 sgkit-0.7.0 six-1.16.0 sortedcontainers-2.4.0 tblib-2.0.0 threadpoolctl-3.2.0 toolz-0.12.0 tornado-6.3.3 typing-extensions-4.7.1 tzdata-2023.3 urllib3-2.0.4 xarray-2023.7.0 zarr-2.16.0 zict-3.0.0 zipp-3.16.2
+1 to making tskit an optional dependency, on the same footing as the other IO packages.