tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Add samples and sites options to ts.genotype_matrix?

Open hyanwong opened this issue 5 years ago • 9 comments
trafficstars

As noted in the docs, the genotype_matrix method can be very memory intensive. As we get to larger datasets, I suspect users might want to only extract portions of the matrix, e.g. certain samples, or certain regions. For example one extreme would be to export the haplotype for a single individual. The obvious way to do this is to run simplify to reduce the number of samples, and keep_intervals to reduce the number of sites, but this isn't likely to be obvious to the average user. Should we provide 2 parameters to the genotype_matrix method itself which, if not None, run these two preprocessing steps to reduce the output before returning the genotype matrix?

hyanwong avatar Jun 13 '20 09:06 hyanwong

I think there's a samples argument to variants which we could also easily add to genotype matrix. That would be an easy PR, I think. Going through simplify etc seems excessive.

Sites is a bit less obvious, but I guess list of site IDs to output would be fine. For now, we could just skip any sites that are not in this list as we're iterating over the variants.

But, ah - this is done in C at the moment IIRC. That's probably pointless, to be honest, and we should probably just implement the genotype_matrix method in Python.

jeromekelleher avatar Jun 15 '20 09:06 jeromekelleher

I'm interested in this, so might have a go at it soon.

mufernando avatar Jun 24 '20 00:06 mufernando

I'm interested in this, so might have a go at it soon.

Great, thanks @mufernando! The first thing to do is to figure out if a Python version of the current genotype_matrix function performs much worse than what we have now. To do this, create a large tree sequence (mininum 100K samples) and do some timing checks.

If we don't lose any real perf by doing it in python, then we can get rid of the current Python-C function, and our lives will be much easier.

jeromekelleher avatar Jun 24 '20 08:06 jeromekelleher

To really do this efficiently for sites we'd need a method for efficiently seeking to particular trees, so I'm putting this in the random access project.

jeromekelleher avatar Sep 29 '20 14:09 jeromekelleher

In removing SlimTreeSequence over in pyslim, I've run in to the ts.mutations_at( ) method I provided there, that provides more or less this functionality. It'd be nice to point to a replacement in tskit. What's the proposal for this API? Is it a method of Variant? I'm not sure if anyone is using it, so perhaps I won't actually do it right away (maybe, though?), but it'd at least be nice to have a concrete path forward worked out, and perhaps I can re-use my testing code from pyslim at least.

petrelharp avatar May 22 '22 14:05 petrelharp

Edit: actually, this method is used in an example recipe in the SLiM manual ("18.13 - Tree-sequence recording and nucleotide-based models") to count up the trinucleotide mutation spectrum (ie the contexts in which mutations occurred), so I will be including the functionality somehow.

petrelharp avatar May 22 '22 15:05 petrelharp

We have https://github.com/tskit-dev/tskit/issues/2176 planned (soon) to bring the new C variant methods up to Python. After this, it will be pretty easy to make convenient methods on top. What's your timescale on needing this?

benjeffery avatar May 23 '22 00:05 benjeffery

Ah, awesome. I can just move over my python code, so I don't need it at any particular time. Which sort of 'soon' is this one, do you think? That'd help me figure out how to deal with it in the short term.

petrelharp avatar May 23 '22 15:05 petrelharp

I would hope before the paper release! So a couple of months?

benjeffery avatar May 23 '22 15:05 benjeffery

@benjeffery has suggested doing the following tasks (as separate PRs) to solve this issue.

~0 - Add ts.genotype_matrix to performance benchmark - PR 1~ ~1 - Change current implementation to python similar to ts.variants - no changes to tests - PR 2~ ~1.5 - Check for perf regression~ 2 - Add sites argument to this function and tests for it - PR 3 ~3 - Add samples argument and tests - PR 4~

szhan avatar Aug 16 '22 12:08 szhan

Samples argument is already done, and can be ticked off.

hyanwong avatar Sep 12 '22 16:09 hyanwong

As agreed in #2494 we will revisit adding sites argument at some time.

szhan avatar Sep 16 '22 15:09 szhan