tskit Windows/Site masking for two-site statistics

Implement a method of masking sites for two-site statistics. Since we're interested in slicing a subset of the correlation matrix, we won't need windows for sites. Or, rather, we'll want 1 window.

The overall goal here is to slice out an NxM matrix, not just an NxN matrix. I imagine that windowing will also allow parallelization of the two-site statistics.

In our discussions on #2805, @petrelharp mentioned a site mask, which I find interesting. One issue with providing a window to the two-sites code is that we won't know which site each element of the matrix corresponds to. So, if we provide a list or mask of sites, we can know up-front which sites correspond to which elements of the matrix. Otherwise, I'd thought of returning a list of sites that fall within the window, which is less appealing I think.

We also might want some helper functions that allow us to mask sites within genomic ranges. I'd be curious to know your thoughts @petrelharp, @jeromekelleher, @apragsdale.

Aug 30 '23 02:08 lkirk

An explicit sites argument sounds good to me. Do you have some proposals for what a Python API would look like?

Aug 30 '23 08:08 jeromekelleher

Here's one proposal to get things moving:

ts.r2_matrix(sites=[[0, 1], [2, 3]])

Returns a 2x2 matrix R of r^2 values from the pairs of sites, like where, R[0, 0] = r2 between site 0 and 2, R[0, 1] is r2 between 0 and 3, etc.

This seems like a reasonably flexible thing to get started with?

Aug 30 '23 08:08 jeromekelleher

I've had a couple of competing thoughts...

Explicitly listing the sites

Advantages: we can leave one out easily (i.e. right_sites=[4, 6, 7]) Disadvantages: the list is longer

ts.r2_matrix(left_sites=[0, 1, 2, 3, 4, 5], right_sites=[4, 5, 6, 7])

Site mask

Advantages: potentially being able to utilize numpy masking and indexing facilities. Disadvantages: need one list element per site, potentially too complex.

left_mask = np.zeros(ts.num_sites, dtype=bool)
right_mask = np.zeros(ts.num_sites, dtype=bool)
left_mask[1:20] = True
left_mask[4] = False
right_mask[:] = True
ts.r2_matrix(left_sites=left_mask, right_sites=right_mask)

Range

Advantages: compact, easy to specify Disadvantages: we must process every site, potentially forcing the user to delete sites or stitch together multiple matricies, skipping desired sites.

ts.r2_matrix(left_sites=(0, 10), right_sites=(2, 4))

What you've suggested is similar to the Range option, but you've put everything in nested lists, which can be tricky for the user. Also, I would imagine that if we left out right_sites, the default would be to get the r2 matrix for a subset of left sites compared to every single right site.

Conclusions?

I'm sort of leaning towards just explicitly listing the sites, since it's the most explicit while retaining a compact feel. We could also implement more than one, but I think that can lead to api creep.

I also think it would be useful if we had some utility that grabbed sites within a range so that the user can still think in genomic space (can be adapted to easily work with any of the above proposed ideas).

sites = ts.get_sites_for_range(start, stop)
sites
>>> [1, 2, 3, 4, 5]

Aug 31 '23 09:08 lkirk

What do you think about having a single sites arg @lkirk? I'm not sure left and right are helpful here, since they're really talking about the rows and columns of the output matrix (right?)

Aug 31 '23 10:08 jeromekelleher

@jeromekelleher I think we that it depends on the way we want to define these. If we use ranges like you've suggested, then we can have a single sites arg because the matrix dimensions make sense when converting to a numpy array under the hood.

For example, your suggestion of a 2x2 matrix works, but my suggestion of left and right lists of sites (left=[1, 2, 3], right=[2, 3]) wouldn't work in matrix form (numpy doesn't like ragged arrays much).

Maybe I'm being too general/flexible by advocating for explicit lists of sites? Ranges are appealing and simple conceptually. That said, I think that the user would still benefit from being able to think in genomic space (basepairs), so I think it'd be useful for a user to slice out a list/range of sites for a given basepair range.

Sep 04 '23 19:09 lkirk

I think we're saying the same thing @lkirk, but with different names. So, your proposal of

ts.r2_matrix(left_sites=[0, 1, 2, 3, 4, 5], right_sites=[4, 5, 6, 7])

would be

ts.r2_matrix([[0, 1, 2, 3, 4, 5], [4, 5, 6, 7]])

The terms left and right aren't very helpful I think. We could have two separate args like

ts.r2_matrix(rows=[0, 1, 2, 3, 4, 5], cols=[4, 5, 6, 7])

but that's basically the same thing, and means you can't do user-friendly things like speicifying a symmetric matrix with a 1D array,

ts.r2_matrix(np.arange(ts.num_sites)) == ts.r2_matrix((np.arange(ts.num_sites)), np.arange(ts.num_sites)))

Let's not worry about how that corresponds to a "real" 2D numpy array in the ragged case - that's going to be a tiny factor in the overall runtime.

For the C API, I think we probably should have two arrays rows and columns.

Unless we're talking about aggregating across multiple sites (are we?), then I don't think we should think about windows here at all, and assume instead that there will be higher level APIs to generate the lists of site IDs corresponding to a given window spec. So, we don't think about genome coordinates at all, for the r2_matrix function.

Sep 05 '23 08:09 jeromekelleher

@jeromekelleher I see, I thought you were talking about ranges, not explicitly addressing each site. I agree that left and right are not useful terms to describe arrays. I think what you're saying seems reasonable: Python api should look like this: ts.r2_matrix([[0, 1, 2, 3, 4, 5], [4, 5, 6, 7]]) and the C api will use rows and columns arrays (much more explicit than left/right).

Yes, I think we should delegate the generation of site lists to a higher level API function.

We will eventually be aggregating over multiple sites (LD scores), but I'm not sure how windows would interact with these site lists. I was thinking about having mutually exclusive arguments in both the C and python API where we could specify windows or sites. We will need windows for branch stats as well.

Sep 05 '23 13:09 lkirk

Hmm, that is trickier. but I guess if we just call it sites, then it's reasonably clear that this wouldn't apply to branch mode.

Maybe we should just implement the sites= version first, and get some Python infrastructure done, and worry about the rest later?

Sep 05 '23 14:09 jeromekelleher

Yes, I agree. I think removing windows entirely and layering them back in as needed is probably the way to go. I can add them when we get to LD scores and branch stats.

I created a sketch of the python implementation and just playing around with it makes me realize that having windows and sites means we're going to have a lot of parameters at the interfaces between python and C.

Sep 05 '23 15:09 lkirk

I think the first use case here is to only include sites with MAF above some cutoff; so an explicit list of sites is the easiest way to do that.

Sep 06 '23 16:09 petrelharp