scdiff
scdiff copied to clipboard
SCDIFF continues to run for >24 hours
Hello,
I'm running a dataset of about 35,000 cells with 10,000 genes, with pre-defined number of clusters. I've turned on the "largetype" feature, but have left the "cell synchronization" feature on.
It's been about 24 hours, and the algorithm is still running. The output appears to be stick on "connecting nodes ....". Is the algorithm stuck, or do I need to wait longer?
Thanks, Paraish
@paraish
Thank you for reporting this issue.
For large datasets, yes, please use the -l 1 -s 1 option to speed up.
As you have ~ 35k cells and 10k genes, I would expect a long run time. (2~3 days)
(Make suer that you used the -l 1 -s 1 option for a less stringent convergence criteria).
Currently, scdiff does not support multi-threading yet and thus need a relatively long-running time.
Consider wait another day or so, if you still haven't got the results by next Monday, please let me know. You can tell whether the program is running by checking that the process is active (e.g. using htop command).
Thank you for your prompt reply - I will keep you posted.
@paraish
Thanks. I tested on a 40k cells on my side (with 10k genes) as well.
let's see how long it takes. If it's taking too long, I will spend some time to upgrade the software in the coming weeks.
Thanks Jun - it's still running after 96 hours. Every so often it continues to display "connecting nodes".
Cheers, Paraish
@paraish Thanks for the update! I guess the tool really needs an upgrade to handle large single-cell datasets. I will work on a large dataset extension this week, will get back to you by this weekend. Sorry for the inconvenience.
No problem - and thank you for your reply and for maintaining support for this tool!
@paraish We have been developing an update version of scdiff -> scdiff2.0, which utilizes hdf5, sparse matrix, and multi-threading strategies that can dramatically improve the running efficiency while reducing the memory usage (16G Ram would be enough for most cases). It's based on a similar PGM strategy as scdiff1. We have tested the new software already on some datasets, it can produce similar if not better results while it's over 100x faster than the original scdiff version. For example, for a sc dataset (40k cells , 10k genes), you would be able to get the results within 1-2 hours if not less (with --ncores 10 --maxloop 0 parameter). It's not officially released yet. Since you have requested a faster version, I packaged all the codes and share it with you via github. Please check the scdiff2 page (https://github.com/phoenixding/scdiff2) for details. I have provided a quick example to showcase the usage. Please let me know if you have any further questions.
Thank you very much Jun! Much appreciated.
Best, Paraish
@paraish I have improved the multi-threading and memory usage today, and just pushed ealier tonight. Please update your scdiff2 if you get it before that. with --maxloop 0 --ncores 10, you should be able to get the results in less than 10minutes. I am working on some documentations now, will push again later tonight.
@paraish Please do let me know if this solves your issue. Feel free to contact me if there are other problems.
Thanks Jun - I will keep you posted.
Paraish