kBET icon indicating copy to clipboard operation
kBET copied to clipboard

Extremely long run time

Open csennis opened this issue 6 years ago • 2 comments

Hello! Thank you for developing this tool.

I am currently trying to run the initial kBET command, but am having difficulty due to an extremely long run time (>24 hours).

My inputs are as follows:

  • df: a matrix where rows are cells (1157) and columns as genes (2000)
  • batch: a vector of factors (length 1157)

When I set verbose = TRUE on this command, I see that it is selecting an optimal number of nearest neighbors of 433, which confuses me a bit seeing as there are only 1157 cells. In addition, I can tell that the code is stuck at identifying knns.

What can I do to optimize the format of my data in order to achieve a run time similar to what is stated on your code (~2 min for ~1k cells)?

Thank you so much!

csennis avatar Aug 01 '19 14:08 csennis

Hi csennis, I'm sorry to hear about the bad performance of the knn search. I cannot comment on the optimal neighbourhood size because this depends on the number of batches. In case of 2 equally sized batches, it should lie in the range of reasonable values. Further, selecting an optimal neighbourhood size requires the computation of a knn-graph. So when the alogrithm as selected an optimal neighbourhood size, it has computed the knn-graph already. That confuses me.

However, kBET offers the possibility to compute the knn-graph separately for example as follows:

library('FNN')
# data: a matrix (rows: samples, columns: features (genes))
k0=433  #your previously determined optimal neighbourhood size
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')
#now run kBET with pre-defined nearest neighbours.
batch.estimate <- kBET(data, batch, k = k0, knn = knn, heuristic = FALSE)

I suggest that you run the knn-graph computation separately and check the runtime of kBET without running knn.

Best regards, Maren

mbuttner avatar Jan 17 '20 09:01 mbuttner

Hi, I am trying kBET on a 24279*14965 matrix, with two batches However, it is stuck at

Initial neighbourhood size is set to 5611. reducing dimensions with svd first...

I also tried running knn separately, which didn't get through either.

May I ask if there is any update on this issue? Please let me know if there are any suggestions on large dataset. Thank you very much!

linzhangTuesday avatar Oct 18 '20 23:10 linzhangTuesday

Hi, if this is still of interest: I recommend to reduce the number of features (genes) through computing PCA/SVD first and run the knn-graph on the low-dimensional object instead.

Best, Maren

mbuttner avatar Nov 11 '22 13:11 mbuttner