Windows/Site masking for two-site statistics
Implement a method of masking sites for two-site statistics. Since we're interested in slicing a subset of the correlation matrix, we won't need windows for sites. Or, rather, we'll want 1 window.
The overall goal here is to slice out an NxM matrix, not just an NxN matrix. I imagine that windowing will also allow parallelization of the two-site statistics.
In our discussions on #2805, @petrelharp mentioned a site mask, which I find interesting. One issue with providing a window to the two-sites code is that we won't know which site each element of the matrix corresponds to. So, if we provide a list or mask of sites, we can know up-front which sites correspond to which elements of the matrix. Otherwise, I'd thought of returning a list of sites that fall within the window, which is less appealing I think.
We also might want some helper functions that allow us to mask sites within genomic ranges. I'd be curious to know your thoughts @petrelharp, @jeromekelleher, @apragsdale.
An explicit sites argument sounds good to me. Do you have some proposals for what a Python API would look like?
Here's one proposal to get things moving:
ts.r2_matrix(sites=[[0, 1], [2, 3]])
Returns a 2x2 matrix R of r^2 values from the pairs of sites, like where, R[0, 0] = r2 between site 0 and 2, R[0, 1] is r2 between 0 and 3, etc.
This seems like a reasonably flexible thing to get started with?
I've had a couple of competing thoughts...
Explicitly listing the sites
Advantages: we can leave one out easily (i.e. right_sites=[4, 6, 7])
Disadvantages: the list is longer
ts.r2_matrix(left_sites=[0, 1, 2, 3, 4, 5], right_sites=[4, 5, 6, 7])
Site mask
Advantages: potentially being able to utilize numpy masking and indexing facilities. Disadvantages: need one list element per site, potentially too complex.
left_mask = np.zeros(ts.num_sites, dtype=bool)
right_mask = np.zeros(ts.num_sites, dtype=bool)
left_mask[1:20] = True
left_mask[4] = False
right_mask[:] = True
ts.r2_matrix(left_sites=left_mask, right_sites=right_mask)
Range
Advantages: compact, easy to specify Disadvantages: we must process every site, potentially forcing the user to delete sites or stitch together multiple matricies, skipping desired sites.
ts.r2_matrix(left_sites=(0, 10), right_sites=(2, 4))
What you've suggested is similar to the Range option, but you've put everything in nested lists, which can be tricky for the user. Also, I would imagine that if we left out right_sites, the default would be to get the r2 matrix for a subset of left sites compared to every single right site.
Conclusions?
I'm sort of leaning towards just explicitly listing the sites, since it's the most explicit while retaining a compact feel. We could also implement more than one, but I think that can lead to api creep.
I also think it would be useful if we had some utility that grabbed sites within a range so that the user can still think in genomic space (can be adapted to easily work with any of the above proposed ideas).
sites = ts.get_sites_for_range(start, stop)
sites
>>> [1, 2, 3, 4, 5]
What do you think about having a single sites arg @lkirk? I'm not sure left and right are helpful here, since they're really talking about the rows and columns of the output matrix (right?)
@jeromekelleher I think we that it depends on the way we want to define these. If we use ranges like you've suggested, then we can have a single sites arg because the matrix dimensions make sense when converting to a numpy array under the hood.
For example, your suggestion of a 2x2 matrix works, but my suggestion of left and right lists of sites (left=[1, 2, 3], right=[2, 3]) wouldn't work in matrix form (numpy doesn't like ragged arrays much).
Maybe I'm being too general/flexible by advocating for explicit lists of sites? Ranges are appealing and simple conceptually. That said, I think that the user would still benefit from being able to think in genomic space (basepairs), so I think it'd be useful for a user to slice out a list/range of sites for a given basepair range.
I think we're saying the same thing @lkirk, but with different names. So, your proposal of
ts.r2_matrix(left_sites=[0, 1, 2, 3, 4, 5], right_sites=[4, 5, 6, 7])
would be
ts.r2_matrix([[0, 1, 2, 3, 4, 5], [4, 5, 6, 7]])
The terms left and right aren't very helpful I think. We could have two separate args like
ts.r2_matrix(rows=[0, 1, 2, 3, 4, 5], cols=[4, 5, 6, 7])
but that's basically the same thing, and means you can't do user-friendly things like speicifying a symmetric matrix with a 1D array,
ts.r2_matrix(np.arange(ts.num_sites)) == ts.r2_matrix((np.arange(ts.num_sites)), np.arange(ts.num_sites)))
Let's not worry about how that corresponds to a "real" 2D numpy array in the ragged case - that's going to be a tiny factor in the overall runtime.
For the C API, I think we probably should have two arrays rows and columns.
Unless we're talking about aggregating across multiple sites (are we?), then I don't think we should think about windows here at all, and assume instead that there will be higher level APIs to generate the lists of site IDs corresponding to a given window spec. So, we don't think about genome coordinates at all, for the r2_matrix function.
@jeromekelleher I see, I thought you were talking about ranges, not explicitly addressing each site. I agree that left and right are not useful terms to describe arrays. I think what you're saying seems reasonable: Python api should look like this: ts.r2_matrix([[0, 1, 2, 3, 4, 5], [4, 5, 6, 7]]) and the C api will use rows and columns arrays (much more explicit than left/right).
Yes, I think we should delegate the generation of site lists to a higher level API function.
We will eventually be aggregating over multiple sites (LD scores), but I'm not sure how windows would interact with these site lists. I was thinking about having mutually exclusive arguments in both the C and python API where we could specify windows or sites. We will need windows for branch stats as well.
Hmm, that is trickier. but I guess if we just call it sites, then it's reasonably clear that this wouldn't apply to branch mode.
Maybe we should just implement the sites= version first, and get some Python infrastructure done, and worry about the rest later?
Yes, I agree. I think removing windows entirely and layering them back in as needed is probably the way to go. I can add them when we get to LD scores and branch stats.
I created a sketch of the python implementation and just playing around with it makes me realize that having windows and sites means we're going to have a lot of parameters at the interfaces between python and C.
I think the first use case here is to only include sites with MAF above some cutoff; so an explicit list of sites is the easiest way to do that.