kBET icon indicating copy to clipboard operation
kBET copied to clipboard

Which kind of normalization for 10X data is perferred for kBET?

Open Smilenone opened this issue 10 months ago • 3 comments

Thanks for such a good tool for assessing single-cell RNA-seq batch correction. I have 180k cells across 20 patients and I would like to analyze whether there is batch effect. I wonder which kind of data is perferred for kBET? raw counts with total genes, log(CPM+1) data with selected highly varibales genes, or z-score normalized log(CPM+1) data? Do I have to selected highly varibales genes or use PCAs as input?

Smilenone avatar Mar 30 '24 05:03 Smilenone

Hi @Smilenone thank you for your appreciation! On the choice of data normalization as input: When it comes to assessing a batch effect, one essentially wants to understand whether there is a batch effect to begin with (i.e. do we need to correct for it? The answer for patient data is often yes.) and second which tool is most suited for the batch correction. I would start with normalized data (log(CPM+1) or scran), because those normalizations worked well in most cases. On the technicalities of kBET: Before you start with kBET, I suggest to downsample your data because kBET computes the k-nearest neighbor graph on a dense matrix by default, which does not scale to large data. There are a few tricks that will speed up the kBET computation, especially if you don't want to downsample. The slowest step is the computation of k-nearest neighbors and there are certainly more efficient algorithms around than the one used in the kBET package. In that case, I would normalize the data (e.g. log(CPM+1)), then select highly variable genes, then compute a PCA (with or without z-score scaling is up to you, I usually do not scale), then compute a k-nearest neighbor graph and pass this object alongside with the data matrix, while turning off the PCA step, the k-nearest neighbor step and fix the number of nearest neighbors to use (should be in the ballpark of number of batches * 5 such that there is a sufficient number of cells expected per batch).

I hope that helps! Please let me know if you have further questions.

mbuttner avatar Apr 01 '24 08:04 mbuttner

Thanks for your detailed response! I have one more question, should I use the average.pval to evaluate whether there exists batch effect in my data? The average.pval <0.05 means there exists batch effect in my data. Am I right?

Smilenone avatar Apr 07 '24 15:04 Smilenone

In general, please beware that kBET is probably the most sensitive tool when it comes to batch effects and we realized that the pval may be extremely low even with very small batch effects, which may not bias your data as much. So the average rejection rate is the most telling metric and you can use the pval comparison for null and actual data as a sanity check.

mbuttner avatar Apr 10 '24 11:04 mbuttner