Speed up Milo tutorial
Description of feature
Currently takes 30mins+ to run which is too long. Maybe subsample or something else. It took a hit when we switched to pydeseq2 but I don't want to go back.
I had the same issue when I switched between two apparently similar datasets. I did some investigations and I think the reason is this behaviour of fit_size_factors:
Uses the median-of-ratios method: see :func:`pydeseq2.preprocessing.deseq2_norm`, unless each gene has at least one sample with zero read counts, in which case it switches to the ``iterative`` method.
It's the iterative method that is very slow (which makes sense looking at the code). Only solution seems to be subsampling, or potentially a larger nhood size, so that it increases the probability that at least one nhood covers all the samples.
Ahh, thank you! Would you be interested in playing around https://github.com/scverse/pertpy-tutorials/blob/main/milo.ipynb and see whether you can make it use the non-iterative method?
Hi @Zethson , I'll give it a go and will report back.
Dear @FrenkT have you been able to find something?
Hey @Zethson , sorry for the delay. Here is what I found.
Increasing neighborhood size moves things in the right direction, but even with n_neighbors=300 the milo count matrix is still quite sparse.
I then noticed from the UMAP that there is still some quite significant batch effect, e.g. some clusters of CD4 T-cells seem to be almost only from "Cambridge" site.
So BBKNN with sc.external.pp.bbknn(adata, batch_key="Site", use_rep="X_scVI", key_added="scvi_conn_bb", neighbors_within_batch=50) seems to already solve the problem. I am using 50 because if I understand correctly it should result in 50 * 3 sites=150 effective size. Although this would require adding an extra dependency, so it might be inconvenient.
That leaves us with either subsampling by site (e.g. "Cambridge" only). I tried excluding "sanger" but that doesn't seem to solve the problem.
I also tried to subsample by cell type, e.g. adata[adata.obs["cell_type"].str.contains("B cell")] or adata[adata.obs["cell_type"].str.contains("T cell")], but the sparsity is still there. So it seems like site batch effects are the root cause.
Any thoughts? Happy to open a merge request in the tutorials repo based on your feedback.
No worries!
Increasing neighborhood size moves things in the right direction, but even with n_neighbors=300 the milo count matrix is still quite sparse.
that's also a huge amount of neighbors.
I wonder whether scanpy's combat might help? But it would also potentially scientifically be questionable...
@emdann any thoughts on this whole issue?