randomized svd draft
Description
A draft of randomized principal component analysis (PCA) using the TreeSequence.genetic_relatedness_vector. The implementation contains spicy.sparse which should eventually be removed.
This part of the code is only used when collapsing a #sample * #sample GRM into a #individual * #individual matrix.
Therefore, it will not be difficult to replace with pure numpy.
The API was partially taken from scikit-learn.
To add some details, iterated_power is the number of power iterations in the range finder in the randomized algorithm. The error of SVD decreases exponentially as a function of this number.
The effect of power iteration is profound when the eigen spectrum of the matrix decays slowly, which seems to be the case of tree sequence GRMs in my experience.
indices specifies the individuals to be included in the PCA, although decreasing the number of individuals does not meaningfully reduce the amount of computation.
@petrelharp Here's the code.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 89.95%. Comparing base (
a405996) to head (774ff5f). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #3008 +/- ##
==========================================
+ Coverage 89.92% 89.95% +0.03%
==========================================
Files 29 29
Lines 32362 32477 +115
Branches 5802 5822 +20
==========================================
+ Hits 29101 29216 +115
Misses 1859 1859
Partials 1402 1402
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 86.71% <ø> (ø) |
|
| lwt-tests | 80.78% <ø> (ø) |
|
| python-c-tests | 89.23% <ø> (ø) |
|
| python-tests | 98.99% <100.00%> (+0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| python/tskit/trees.py | 98.86% <100.00%> (+0.05%) |
:arrow_up: |
🚀 New features to boost your workflow:
- ❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
This looks great! Very elegant. I think probably we ought to include a samples argument, though? For consistency, but also since the tree sequence represents phased data, and so it's actually informative to look at the PCs of maternally- and paternally-inherited chromosomes separately.
So, how about the signature is like
def pca(samples=None, individuals=None, ...)
and:
- the default is equivalent to
samples=ts.samples(), individuals=None - you can't have both
samplesandindividualsspecified - if
individualsis a list of individual IDs then it does as in the code currently - otherwise, is just skips the "sum over individuals" step
Note that we could be getting PCs for non-sample nodes (since individual's nodes need not be samples); I haven't thought through whether the values you get are correct or informative. My guess is that maybe they are? But we need a "user beware" note for this?
Ah, sorry - one more thing - does this work with windows? (It looks like not?)
I think the way to do the windows would be something like
drop_windows = windows is None
if drop_windows:
windows = [0, self.sequence_length]
# then do stuff; with these windows genetic_relatedness will always return an array where the first dimension is "window";
# so you can operate on each slice separately
if drop_windows:
# get rid of the first dimension in the output
Basically - get it to work in the case where windows are specified (ie not None) and then we can get it to have the right behavior.
A simple test case for the windows feature.
demography = msprime.Demography()
demography.add_population(name="A", initial_size=5_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
ts = msprime.sim_ancestry(
samples={"A": 500, "B": 500},
sequence_length=1e6,
recombination_rate=3e-8,
demography=demography,
random_seed=12)
seq_length = ts.sequence_length
U, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])
idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[:,idx]), np.corrcoef(U[1][:,idx], U1[:,idx])
Because of the randomness of the algo, the correlation is not exactly 1, although it's nearly 1 like 0.99995623-ish.
I just noticed that centre doesn't work with nodes option. The new commit fixed this problem.
Check results for two windows.
demography = msprime.Demography()
demography.add_population(name="A", initial_size=5_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
seq_length =1e6
ts = msprime.sim_ancestry(
samples={"A": 500, "B": 500},
sequence_length=seq_length,
recombination_rate=3e-8,
demography=demography,
random_seed=12)
# for individuals
U, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])
idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[0][:,idx]), np.corrcoef(U[1][:,idx], U1[0][:,idx])
# for nodes
U, _ = ts.pca(iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])
idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[0][:,idx]), np.corrcoef(U[1][:,idx], U1[0][:,idx])
Okay; here I've made a good start at the tests. I think everything is working fine; the tests are not passing because (I think) of numerical tolerance. I could just set the numerical tolerance to like 1e-4 and they'd pass, but I think this is flagging a bigger issue - how do we tell we're getting good answers? I ask because currently the tests pass (at default tolerances) for small numbers of samples but not for 30 samples; if I increase iterated_power then they do pass; so the current defaults seem okay for 30 samples, but I worry that at 10,000 samples they might be very bad. If u, s is one of the output vector, value pairs, can we compare s * u to W @ u and translate the result to some measure of tolerance? Not very good options here that I can think of are
- add a
toleranceargument and we iterate until this tolerance is reached - return an estimate of error; perhaps we should return some kind of structured output that contains information about convergence as well?
We'd like this to be not a can of worms; I think our goal is to have something that is good enough, and fowards-compatible for an improved method in the future.
Notes:
- For testing, I use
np.linalg.svd, and (of course) had to mess around so that we'd correctly deal with indeterminancy of sign and also indeterminancy of ordering when singular values are the same. - Perhaps we should call it
num_componentsrather thann_componentsto match elsewhere? - I did some rearranging of how the
operatorin_rand_svdwas defined to avoid lint complaining (but was unsuccessful; so inserted theNOQA) - Some of the changes were due to automatic linting; apologies for the noise there.
TODO:
- add tests that include
samplesandindividualsarguments - test on some weirder tree sequences (that's easy)
Just a quick note that I'd be very much in favour of returning a dataclass here rather than a tuple so that the option of returning more information about convergence etc is open.
There's an adaptive rangefinder algorithm described in Halko et al. (https://arxiv.org/pdf/0909.4061, Algo 4.2). I don't see it implemented in scikit-learn (https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD). Halko et al. also has an error estimation formula with an overhead similar to the main routine.
I like Jerome's idea to return a class instead of the result. There's an intermediate matrix Q (approximate range of the original matrix) that takes up most of the computation, and storing it is useful in many cases. For example, one can add additional power iterations to Q without doing it from scratch to improve precision.
Not a classobject yet but added random_sketch to the input/output.
Re the result object, I'd imagined something like
@dataclasses.dataclass
class PcaResult:
descriptive_name1: np.ndarray # Or whatever type hints we can get to work
descriptive_name2...
Now, pca() returns a dataclass of the following
@dataclass
class PCAResult:
U: np.ndarray
D: np.ndarray
Q: np.ndarray
E: np.ndarray
U and D are as before. Q is the range sketch matrix that is used as the approximate orthonormal basis of the GRM. It is also the most and the only expensive part of the algorithm that involves GRM*matrix operations. E is the error bounds for the singular values. Both Q and E will have different values for each windows if present.
A user can continuously improve their estimate through Q. pca now has a range_sketch: np.ndarray = None option that accepts Q from the previous found of the pca. This can be done like
pca_result = ts.pca( ... )
pca_result_round_2 = ts.pca( ..., range_sketch = pca_result.Q, ...)
If the first round did q power iterations and the second round did p additional power iterations, the result of the second round has total q+p iterations. By adding additional power iterations in successive rounds, one can improve the accuracy without running the whole process from scratch.
It now has a time-resolved feature. You can select branches within the lower and the upper time limits. It is based on decapitate.
NICE!
I made a pass through the docs. We need to add time_windows to the tests still, and see what's going on with the CI.
It looks like the things to do here are:
- get the tests working (right now they fail with
FAILED tests/test_relatedness_vector.py::TestPCA::test_bad_windows - TypeError: pca() got an unexpected keyword argument 'n_components' - either remove for now the
time_windowsargument or write tests for it - write tests that exercise the
individualsargument (or remove it) - write a test that uses
range_sketch - write tests that exercise
iterated_powerandnum_oversamples: probably, just something that checks whether setting these to bigger numbers still gets us (nearly) the same answer
Bumping this one - we want to get ts.pca implemented and released as soon as we can. What's left to do here @hanbin973? Can we help with anything to get it over the line?
Bumping this one - we want to get ts.pca implemented and released as soon as we can. What's left to do here @hanbin973? Can we help with anything to get it over the line?
It's just the test codes that are missing. We have to make the tests pass. I told @petrelharp that I will work on it, but well it didn't go as planned :(
I don't think this works if we use it on a tree sequence where the samples aren't 0,...,n. Trying this out on the SARS-CoV-2 data, I got:
/tmp/ipykernel_620501/3297030094.py in ?()
----> 1 pca = ts.pca(4, samples=ts.samples())
~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(self, num_components, windows, samples, individuals, time_window, mode, centre, iterated_power, num_oversamples, random_seed, range_sketch)
8856 else:
8857 low = _f_low(arr=x, indices=indices, mode=mode, centre=centre, windows=this_window)
8858 return high - low
8859
-> 8860 U[i], D[i], _, Q[i], E[i] = _rand_svd(
8861 operator=_G,
8862 operator_dim=dim,
8863 rank=num_components,
~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(operator, operator_dim, rank, depth, num_vectors, rng, range_sketch)
8803 """
8804 Algorithm 8 in https://arxiv.org/pdf/2002.01387
8805 """
8806 assert num_vectors >= rank > 0
-> 8807 Q = _rand_pow_range_finder(
8808 operator, operator_dim, num_vectors, depth, num_vectors, rng, range_sketch
8809 )
8810 C = operator(Q).T
~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(operator, operator_dim, rank, depth, num_vectors, rng, range_sketch)
8786 else:
8787 Q = range_sketch
8788 for _ in range(depth):
8789 Q = np.linalg.qr(Q).Q
-> 8790 Q = operator(Q)
8791 Q = np.linalg.qr(Q).Q
8792 return Q[:, :rank]
~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(x)
8852 def _G(x):
-> 8853 high = _f_high(arr=x, indices=indices, mode=mode, centre=centre, windows=this_window)
8854 if time_window is None:
8855 return high
8856 else:
~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(self, arr, indices, mode, centre, windows)
8610 centre: bool = True,
8611 windows = None,
8612 ) -> np.ndarray:
8613 x = arr - arr.mean(axis=0) if centre else arr
-> 8614 x = self._expand_indices(x, indices)
8615 x = self.genetic_relatedness_vector(
8616 W=x, windows=windows, mode=mode, centre=False, nodes=indices,
8617 )[0]
~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(self, x, indices)
8596 x: np.ndarray,
8597 indices: np.ndarray
8598 ) -> np.ndarray:
8599 y = np.zeros((self.num_samples, x.shape[1]))
-> 8600 y[indices] = x
8601
8602 return y
IndexError: index 2482157 is out of bounds for axis 0 with size 2482157
[6]:
but if I first simplify (ts = ts.simplify()) so that the samples are 0 to n it runs fine. Just needs a simple test for this case I'd imageine.
I successfully ran this on a 2.5M sample SARS-CoV-2 ARG. Took about 30 seconds, and seemed to converge OK (at least from my cursory check using the recommended approach). A summary of the results is here: https://github.com/jeromekelleher/sc2ts-paper/issues/372
Okay - I've put in a bunch of the testing code for correct arguments, etcetera. Still TODO:
- [ ] testing for the error bounds
- [ ] figure out why
individualsis not passing tests (see below)
One issue I've turned up along the way is that the individuals mode I think was not agreeing with our definition of genetic_relatedness: as described in the relatedness paper, relatedness between two sample sets is the average of the relatedness between pairwise combinations of samples. This is just a factor of 4 for diploids - but that's a discrepancy that will confuse people - and actually makes a difference for mixed ploidy (eg sex chromosomes). I think I've sorted this out correctly.
Have we convinced ourselves that the default iterated_power is a good value? I have a bit of a hard time figuring out what "good" means - like, what tolerance do we need these eigenvectors to?
I also refactored the code to randomly generate range_sketch outside of the loop over windows, because this makes (a) the testing-for-correct-inputs code up top cleaner, and (b) the _rand_pow_range_finder and _rand_svd functions simpler. This has a memory downside of having to have range_sketch for all windows, not just one -- but, this is something we're returning anyhow, so is not a big deal.
I added an individuals argument to the testing implementation, but this isn't matching; I've stared at it a bunch and don't know why. Oddly, ts.pca( samples=x) and ts.pca( individuals=x) match for haploid individuals, but these don't match to the testing code.
Also - I wonder if a better name for iterated_power would be power_iterations?
Oh, I see that scikit-learn says iterated_power (and also n_components instead of num_components). Hm. I think that being compatible with them is good, but copying their bad API choices is not requred? (Well, n_components is a fine API, but it isn't consistent with the rest of our API?, which is num_X for things?)
I'd vote for compatible with scikit second, compatible with our other APIs first.
Okay - I've put in a bunch of the testing code for correct arguments, etcetera. Still TODO:
- [ ] testing for the error bounds
- [ ] figure out why
individualsis not passing tests (see below)One issue I've turned up along the way is that the
individualsmode I think was not agreeing with our definition ofgenetic_relatedness: as described in the relatedness paper, relatedness between two sample sets is the average of the relatedness between pairwise combinations of samples. This is just a factor of 4 for diploids - but that's a discrepancy that will confuse people - and actually makes a difference for mixed ploidy (eg sex chromosomes). I think I've sorted this out correctly.Have we convinced ourselves that the default
iterated_poweris a good value? I have a bit of a hard time figuring out what "good" means - like, what tolerance do we need these eigenvectors to?I also refactored the code to randomly generate
range_sketchoutside of the loop over windows, because this makes (a) the testing-for-correct-inputs code up top cleaner, and (b) the_rand_pow_range_finderand_rand_svdfunctions simpler. This has a memory downside of having to haverange_sketchfor all windows, not just one -- but, this is something we're returning anyhow, so is not a big deal.I added an
individualsargument to the testing implementation, but this isn't matching; I've stared at it a bunch and don't know why. Oddly,ts.pca( samples=x)andts.pca( individuals=x)match for haploid individuals, but these don't match to the testing code.
What is the exact failure about? Is it the eigenvalues not matching or the factor scores not matching? In terms of the factor scores, I think it's better to compare U.T @ U to each other so that we avoid rotation and sign problems. Shall I change the code?
The failure is that the two answers are Definitely Not the Same. Clicking on the "Tests / ..." link above, or running locally, we get
FAILED tests/test_relatedness_vector.py::TestPCA::test_individuals[1-0-True] - AssertionError:
Not equal to tolerance rtol=1e-07, atol=1e-08
Mismatched elements: 5 / 5 (100%)
Max absolute difference among violations: 5733.50077407
Max relative difference among violations: 0.60127089
ACTUAL: array([3802.135878, 3157.318008, 2779.806231, 2273.337602, 1637.086583])
DESIRED: array([9535.636652, 5611.718265, 4424.8184 , 4081.358172, 2913.365737])
To run this locally, do
python -m pytest tests/test_relatedness_vector.py::TestPCA::test_individuals[1-0-True]
I hear what you're saying about comparing U.T @ U, but I think we do want to actually test the individual eigenvectors, not just that the low dimensional approx is right. (Or, maybe I don't understand your suggestion?) The testing code is robust to ordering of eigen-things and sign switches; however, if two eigenvalues are very close, then you're right, we could have problems with nonidentifiabilty that way. Please do have a look at the testing code, though, to see if you have a better suggestion?
But - I don't think that's the problem - inserting a print into the testing code shows that these are the eigenvalues:
[3837.56623916, 3364.17632569, 2843.38691979, 2419.60371452, 655.31546225]
Good catch!
ndows,)
- U, D = pca(ts=ts, windows=windows, samples=samples, centre=centre)
+ U, D = pca(ts=ts, windows=windows, centre=centre, samples=samples, individuals=individuals)
And - hah - I guess I missed the ... in the output saying that they're of different lengths.
Now all the tests are passing in my local machine. The error was due to a minor omission of the individuals option in the verify_pca() function.
Hey, that's great! I need to scrutinize codecov still, but there's one not totally trivial linting issue:
python/tskit/trees.py:8886:24: B023 Function definition does not bind loop variable '_f_high'.
python/tskit/trees.py:8888:29: B023 Function definition does not bind loop variable 'indices'.
python/tskit/trees.py:8891:29: B023 Function definition does not bind loop variable 'this_window'.
python/tskit/trees.py:8896:27: B023 Function definition does not bind loop variable '_f_low'.
python/tskit/trees.py:8898:33: B023 Function definition does not bind loop variable 'indices'.
python/tskit/trees.py:8901:33: B023 Function definition does not bind loop variable 'this_window'.
(I get this by running the pre-commit hooks; see the developer docs for how to set this up; in particular by
pre-commit run --files tskit/trees.py tests/test_relatedness_vector.py
Would you like to have a look at how to deal with that? (I haven't figured out what it means yet.)
The linting error appears when a global variable inside a function that is defined within a loop is the loop variable. I can't immediately think of a case where this causes a problem, but the point is that the use of the loop variable inside the function should be explicit. An easy hack is to add the loop variable to the function argument and set it as the default value. This is adapted from : https://stackoverflow.com/questions/54947434/function-definition-in-a-for-loop-using-local-variable