glmpca-py
glmpca-py copied to clipboard
Factor Ordering
Are the factors ordered in some way? For classical PCA, PCs are canonically defined and ordered by variance explained.
I noticed from a small experiment that the loadings appear to be orthonormal but the factors are not.
Also, is there any necessary relationship between the matrices learned for different numbers of PCs? In another small experiment, loading vectors from 5 PC factorization and the 10 PC factorization were essentially unrelated.
Good question and yes, the factors are ordered in decreasing L2 norm just like PCA (although this doesn't correspond to variance explained in the data because of the nonlinear link function). We also project off any variation in the factors that is not orthogonal to covariates (X,Z). This gets absorbed by coefX and coefZ. One result of this is that the means of the latent factors are always zero since we always include an intercept term. And, we always transform the loadings to be orthonormal, which necessarily causes the factors to not be orthonormal as you mentioned. All of this is implemented in the ortho
function in case you are interested. The detailed derivations are in section 5 of https://arxiv.org/abs/1907.02647
There is no necessary relationship between the matrices learned under different numbers of PCs. It's totally possible that variation captured by dimension 1 under L=5 could end up being captured by loadings 2-3 of a dimension reduction with L=10 for example. I'm not sure but I think this is common to other non-eigenvector based dimension reduction methods such as LDA and NMF. However, we would expect that if you apply clustering or compute nearest neighbors in the latent space with L=5 and compare it to a clustering with L=10 the results should not be dramatically different.
Great.
Followup: I was curious to see how the result differed as a function of k := nPCs.
What's the most efficient way to reuse computation to do that sweep? I tried working my way down from large k, but the deviance ended up still being quite large at the beginning of each round. In particular, I fed it factors and loadings from a larger k as init for a smaller one.
output = None
init = {'factors': None, 'loadings': None}
for idx, k in enumerate(k_range):
if output is not None:
init = {'factors': output['factors'],
'loadings': output['loadings']}
output = glmpca.glmpca(Y, k, penalty=0, init=init,
fam='poi', verbose=True)
Is this a reasonable thing to do? Is there a problem with not having the gene intercepts initialized as well?
I think it's reasonable, and there shouldn't be any issue with the row means and offsets since those are always re-initialized separately. I would suggest maybe starting with small k and using that to initialize k<-k+1 since then you always preserve all the information from the previous run in initializing the next run (in this case glmpca will initialize small random numbers in the (k+1) dimension). In my limited experiments the "optimized" initialization rarely lead to convergence faster than just a random initialization. I suspect it may be due to the Fisher scoring which can jump around more than first order methods like gradient descent. I also tried initializing with SVD but it didn't really matter. Often I have seen the deviance actually get worse in the first few iterations then drop dramatically, as if it is climbing over a threshold (try plotting output['dev']
). BTW I'm surprised it's converging for you with penalty=0, as this often lead to numerical problems for me, especially if you have any cells/genes with unusually large counts.
Beginning with small k
and increasing helped sometimes, and not others. I am curious...it seems like if the additional columns are small and random, then initial deviance should be the same as that for the init (ie, final deviance for previous round). This does not seem to be the case.
See gist
(Given that most of the time is spent fine-tuning at the end, this wouldn't necessarily give a large speedup. I'm more curious to understand what's happening.)
I did run into errors with penalty=0 on another example, so I have set it to a small value. For integer data, is there a natural scale for the penalty? Does it represent a bayesian prior on coefficient sizes or something (as in regression where it represents a gaussian prior), and if so, is there intuition for what value to choose for eg scRNA-seq data?
Thanks for all the help understanding your work!
Sorry for the slow reply- I'm traveling right now. I'm not sure why the deviance jumps when adding an extra dimension, will have to look into it.
Regarding the "natural scale" for the penalty- it does have an interpretation as the precision (1/variance) of a gaussian prior iid on each element of both the factors and the loadings (U and V). If covariates (X,Z) are included, their coefficients are not penalized. I'm not sure how to interpret the penalty though beyond that. According to this generative model, we are dealing with sums of products of normal distributions and then passing them through the inverse link function (exp). It could be possible though to figure it out.