pertpy Speed up Milo tutorial

Description of feature

Currently takes 30mins+ to run which is too long. Maybe subsample or something else. It took a hit when we switched to pydeseq2 but I don't want to go back.

Jul 05 '25 10:07 Zethson

I had the same issue when I switched between two apparently similar datasets. I did some investigations and I think the reason is this behaviour of fit_size_factors:

 Uses the median-of-ratios method: see :func:`pydeseq2.preprocessing.deseq2_norm`, unless each gene has at least one sample with zero read counts, in which case it switches to the ``iterative`` method.

It's the iterative method that is very slow (which makes sense looking at the code). Only solution seems to be subsampling, or potentially a larger nhood size, so that it increases the probability that at least one nhood covers all the samples.

Aug 14 '25 08:08 FrenkT

Ahh, thank you! Would you be interested in playing around https://github.com/scverse/pertpy-tutorials/blob/main/milo.ipynb and see whether you can make it use the non-iterative method?

Aug 15 '25 15:08 Zethson

Hi @Zethson , I'll give it a go and will report back.

Sep 10 '25 08:09 FrenkT

Dear @FrenkT have you been able to find something?

Nov 14 '25 19:11 Zethson

Hey @Zethson , sorry for the delay. Here is what I found.

Increasing neighborhood size moves things in the right direction, but even with n_neighbors=300 the milo count matrix is still quite sparse.

I then noticed from the UMAP that there is still some quite significant batch effect, e.g. some clusters of CD4 T-cells seem to be almost only from "Cambridge" site. So BBKNN with sc.external.pp.bbknn(adata, batch_key="Site", use_rep="X_scVI", key_added="scvi_conn_bb", neighbors_within_batch=50) seems to already solve the problem. I am using 50 because if I understand correctly it should result in 50 * 3 sites=150 effective size. Although this would require adding an extra dependency, so it might be inconvenient. That leaves us with either subsampling by site (e.g. "Cambridge" only). I tried excluding "sanger" but that doesn't seem to solve the problem.

I also tried to subsample by cell type, e.g. adata[adata.obs["cell_type"].str.contains("B cell")] or adata[adata.obs["cell_type"].str.contains("T cell")], but the sparsity is still there. So it seems like site batch effects are the root cause.

Any thoughts? Happy to open a merge request in the tutorials repo based on your feedback.

Nov 15 '25 12:11 FrenkT

No worries!

Increasing neighborhood size moves things in the right direction, but even with n_neighbors=300 the milo count matrix is still quite sparse.

that's also a huge amount of neighbors.

I wonder whether scanpy's combat might help? But it would also potentially scientifically be questionable...

@emdann any thoughts on this whole issue?

Nov 16 '25 10:11 Zethson