tskit
tskit copied to clipboard
Add samples and sites options to ts.genotype_matrix?
As noted in the docs, the genotype_matrix method can be very memory intensive. As we get to larger datasets, I suspect users might want to only extract portions of the matrix, e.g. certain samples, or certain regions. For example one extreme would be to export the haplotype for a single individual. The obvious way to do this is to run simplify to reduce the number of samples, and keep_intervals to reduce the number of sites, but this isn't likely to be obvious to the average user. Should we provide 2 parameters to the genotype_matrix method itself which, if not None, run these two preprocessing steps to reduce the output before returning the genotype matrix?
I think there's a samples argument to variants which we could also easily add to genotype matrix. That would be an easy PR, I think. Going through simplify etc seems excessive.
Sites is a bit less obvious, but I guess list of site IDs to output would be fine. For now, we could just skip any sites that are not in this list as we're iterating over the variants.
But, ah - this is done in C at the moment IIRC. That's probably pointless, to be honest, and we should probably just implement the genotype_matrix method in Python.
I'm interested in this, so might have a go at it soon.
I'm interested in this, so might have a go at it soon.
Great, thanks @mufernando! The first thing to do is to figure out if a Python version of the current genotype_matrix function performs much worse than what we have now. To do this, create a large tree sequence (mininum 100K samples) and do some timing checks.
If we don't lose any real perf by doing it in python, then we can get rid of the current Python-C function, and our lives will be much easier.
To really do this efficiently for sites we'd need a method for efficiently seeking to particular trees, so I'm putting this in the random access project.
In removing SlimTreeSequence over in pyslim, I've run in to the ts.mutations_at( ) method I provided there, that provides more or less this functionality. It'd be nice to point to a replacement in tskit. What's the proposal for this API? Is it a method of Variant? I'm not sure if anyone is using it, so perhaps I won't actually do it right away (maybe, though?), but it'd at least be nice to have a concrete path forward worked out, and perhaps I can re-use my testing code from pyslim at least.
Edit: actually, this method is used in an example recipe in the SLiM manual ("18.13 - Tree-sequence recording and nucleotide-based models") to count up the trinucleotide mutation spectrum (ie the contexts in which mutations occurred), so I will be including the functionality somehow.
We have https://github.com/tskit-dev/tskit/issues/2176 planned (soon) to bring the new C variant methods up to Python. After this, it will be pretty easy to make convenient methods on top. What's your timescale on needing this?
Ah, awesome. I can just move over my python code, so I don't need it at any particular time. Which sort of 'soon' is this one, do you think? That'd help me figure out how to deal with it in the short term.
I would hope before the paper release! So a couple of months?
@benjeffery has suggested doing the following tasks (as separate PRs) to solve this issue.
~0 - Add ts.genotype_matrix to performance benchmark - PR 1~ ~1 - Change current implementation to python similar to ts.variants - no changes to tests - PR 2~ ~1.5 - Check for perf regression~ 2 - Add sites argument to this function and tests for it - PR 3 ~3 - Add samples argument and tests - PR 4~
Samples argument is already done, and can be ticked off.
As agreed in #2494 we will revisit adding sites argument at some time.