glmpca icon indicating copy to clipboard operation
glmpca copied to clipboard

Batch correction - strange convergence

Open hanbin973 opened this issue 4 years ago • 11 comments

Hi, I found out something strange when using batch correction. These are the two result from Muraro (pancreas) and MacParland (liver) datasets. This is from Muraro image and this is from Macparland image Each had 4 and 5 donors respectively. The commnad I used is the following:

RunGLMPCA(obj, X=model.matrix(as.formula('~factor(batch)'))

The result is less strange when I remove the batch covariate matrix X from the model but the result is without batch correction. For example, the Muraro et al.'s pancreas dataset image the batch effect is present (in orange) but the overall result look less weird.

Any solutions possible at this time?

hanbin973 avatar Apr 22 '21 00:04 hanbin973

Hello and thanks for trying glmpca. That is indeed a weird result! To be honest I have not thoroughly tested the batch correction feature. However, I suspect there is a numerical problem. By default, glmpca will subtract off the column means of the covariates matrix X and tack on a column of all ones. Since your provided model.matrix is going to already have a column of ones, I believe this is creating a duplicate column of ones. This collinearity is likely to cause numerical problems. Please try the following and let me know what happens:

X<-model.matrix(as.formula('~factor(batch)')
RunGLMPCA(obj, X=X[,2:ncol(X)])

In case you are interested here is the relevant part of the code.

willtownes avatar Apr 22 '21 01:04 willtownes

Note to self, it would be good to check for this condition and drop the extra column automatically in the initialization code.

willtownes avatar Apr 22 '21 01:04 willtownes

I have tested after dropping the first column as you suggested but getting something nearly the same:

image

The code:

X <- as.matrix(model.matrix(as.formula('~factor(donor)'), [email protected]))
X <- X[,2:ncol(X)]
pbmc_gpca <- RunGLMPCA(pbmc, L=30, 
                       X=X, 
                       sz=size_factor)

hanbin973 avatar Apr 22 '21 03:04 hanbin973

Yikes, still pretty weird! What happens if you set L=2 and/or set the learning rate to something small like .0001 ? To do the latter, you can pass it in as the "ctl" list ctl=list(lr=.001). The learning rate default is normally 0.1. And how many features (genes) are in the pbmc matrix?

willtownes avatar Apr 22 '21 04:04 willtownes

I supplied 2000 features (feature selection was performed using Seurat's default vst) . By the way, I really thank you for the super fast replies. Trying adjusting the lr.

p.s. The resulting plots are GLM-PCA supplied to t-SNE (python openTSNE) with default settings.

hanbin973 avatar Apr 22 '21 04:04 hanbin973

Ohhh, that makes more sense that they are from tSNE. Please try modifying the perplexity parameter on the tSNE to see if that makes a difference. It might also be worth trying the Rtsne package and one of the UMAP implementations with glmpca initialization just in case there is a bug in openTSNE. I have seen tSNE sometimes do weird stuff even with standard PCA initializations. If you run GLM-PCA with L=2 you can directly visualize the latent space without needing to pass to tSNE, so that should also be helpful to compare.

willtownes avatar Apr 22 '21 04:04 willtownes

image Adjusting the learning rate didn't help (X with dropped intercepts, openTSNE)

The next figure was produced after reducing the perplexity but still the same image

This one is a pure GLM-PCA with L=2 image which looks OK.

This one is GLM-PCA (X with dropped interceps, reduced lr to 0.001, Seurat TSNE) + TSNE which is clearly strange image

This one is GLM-PCA + UMAP (other parameters are the same with the previous TSNE setting) image

I think learning rate nor the specific tSNE implementation nor the UMAP implementation is the problem. Probably L>5~10 + batch matrix is not working properly in the current implementation.

hanbin973 avatar Apr 22 '21 04:04 hanbin973

Same problem in UMAP. Any updates?

chaudhry123 avatar Nov 28 '22 16:11 chaudhry123

Sorry, I haven't had a chance to work on this, and probably won't for some time. If anyone else gets a chance to figure it out I'd be curious to know what's going on!

willtownes avatar Nov 28 '22 19:11 willtownes

Not sure but may be some initialization scheme through (computationally much cheaper) PCA might help. Is there an option in GLM-PCA that takes initial latent coordinates?

hanbin973 avatar Nov 29 '22 00:11 hanbin973

Yes, there is an option called init where you can pass in pre-estimated factors, loadings, or both. I have tried it a few times but usually it takes just as long (if not longer) to converge than using a random initialization.

willtownes avatar Dec 01 '22 23:12 willtownes