kBET on Harmony or CCA integration

Open martina811 opened this issue 9 months ago • 1 comments

Hello! I am trying to use kBET on a very large integrated dataset (40.000 genes across 180.000 cells). I have few question about it.

Since both Harmony and CCA do not generate a batch-corrected counts, what should i pass as input for kbet? If i want to compare it with unintegrated datasets (where the counts are still the same), is it better to give as input for kBET the embeddings?
I am having some troubles if I want to work with the whole datasets, is there a maximum size for the input dataframe?

Thank you for your help!

Mar 07 '25 12:03 martina811

Hi @martina811 thank you for your questions!

You can run kBET on the embeddings directly. Internally kBET computes a k-nearest neighbor graph to assess batch effects.
You can tweak kBET to run faster and potentially on your larger dataset. The fastest is usually to pass a k-nearest neighbor graph in the same structure as the FNN package would provide it, turn off any pre-processing and set a fixed neighborhood size k (see https://github.com/theislab/kBET?tab=readme-ov-file#variations)

I hope that helps!

Mar 07 '25 16:03 mbuttner