Format of batch input : vector or factor ?

Open martina811 opened this issue 9 months ago • 1 comments

Hello, I noticed an issue while using kBET with the subsampling of my data, basically I saw differences in the result ok kBET if I declare batch as vector rather than as a factor I attach here below an example.

Here I declare batch as factor

batch <- setNames(as.factor(metadatad$batch), metadatad$Cell.ID) batch_tmp <- batch[clusters == "80"] #example with cluster 80 class(batch_tmp)

[1] "factor"

kBET_tmp.factor <- kBET(df=data_tmp, batch=batch_tmp, plot=FALSE, verbose=TRUE)

Initial neighbourhood size is set to 100. reducing dimensions with svd first... finding knns...done. Time: user system elapsed 0.045 0.002 0.048 KNN input is a list, extracting nearest neighbour index. Number of kBET tests is set to 54. There are 62 cells (11.567%) that do not appear in any neighbourhood. The expected frequencies for each category have been adapted. Cell indexes are saved to result list. Determining optimal neighbourhood size ...done. New size of neighbourhood is set to 32.

kBET_tmp.factor$summary$kBET.observed

[1] NaN NA NA NA

If I declare batch as vector:

batch_vector<-setNames(as.character(batch_tmp),names(batch_tmp)) class(batch_vector)

[1] "character"

kBET_tmp.vector <- kBET(df=data_tmp, batch=batch_tmp, plot=FALSE, verbose=TRUE)

Initial neighbourhood size is set to 100. reducing dimensions with svd first... finding knns...done. Time: user system elapsed 0.045 0.003 0.047 KNN input is a list, extracting nearest neighbour index. Number of kBET tests is set to 54. There are 62 cells (11.567%) that do not appear in any neighbourhood. The expected frequencies for each category have been adapted. Cell indexes are saved to result list. Determining optimal neighbourhood size ...done. New size of neighbourhood is set to 21. There were 40 warnings (use warnings() to see them)

warnings() In full.classes[class.freq$class %in% names(freq.env)] <- freq.env : number of items to replace is not a multiple of replacement length

kBET_tmp.vector$summary$kBET.observed

[1] 0.01666667 0.00000000 0.01851852 0.05555556

And this is the distribution of each batch in the cluster

table(batch_vector) A B C D 2 526 7 1

So I was wondering if there is a suggested format that batch has to have for a proper execution of the kBET and why I got different results?

Mar 14 '25 12:03 martina811

Hi @martina811 thank you for pointing out this issue. I tested kBET with batch labels encoded as integers and as factors. My recommendation is that you should use the input where the observed kBET values are not NaNs. Also, thank you for sharing the observed cell counts per batch. I am worried that the distribution is very skewed with 98% of cells belonging to the same batch, such that the lower bound p-value of the underlying chi-square test is comparatively large. For example, in a neighbourhood of 100 cells, I only expect to see two cells of another batch (i.e. batch C) - I think that you can then compute the theoretical lower bound p-value yourself. The point is, you might achieve pretty good kBET values already even if the 10 cells originating from batches A, C and D are isolated (not having mutual nearest neighbours), so in addition of computing kBET, I would check the connectivity of these cells as well. I hope that helps.

Mar 20 '25 22:03 mbuttner