SVDSS icon indicating copy to clipboard operation
SVDSS copied to clipboard

Run time estimation?

Open LYC-vio opened this issue 1 year ago • 4 comments

Hi

Sorry for submitting a bunch of new issues at a time. I'm kind of curious about the running time SVDSS needed to index and search on datasets with different sizes, or how much the size of the input will affect the time cost. e.g., I've run the index on a 3G reference genome with thread 10 and it took around 40min, how much time do I need to index a short read data of ~200G?

I've read the corresponding SVDSS paper but did not find evaluations about the time cost, sorry if I missed something.

Thank yoou

LYC-vio avatar Jul 23 '23 12:07 LYC-vio

Hi, unfortunately I don't have any statistics on this.. I'd say that time complexity should be linear, but you never know until you try it 😄 I could run some tests and get back to you or maybe you can check the numbers on this table (ropebwt2 rows).

You may also find more information on our previous paper: in the supplementary I see that it needed 20 hours to index a ~30x PacBio HiFi sample..

Just a note: the number of threads used by the indexing step is fixed at 4 and cannot be changed. Moreover, when you index a reference genome (so limited number of entries in the .fasta), I think that the current version uses a single thread (indeed, you should see a warning like "Turn off parallelization for this batch as too few strings are left.")

ldenti avatar Jul 24 '23 07:07 ldenti

Thank you!

Do you mean the --threads does not change the actural number of threads SVDSS uses for the index step? Or you are referring to the index thread settings in PingPong?

May I also ask why the thread number is fixed to 4 for the index step? Is that due to memory limit or something else?

you can check the numbers on this table (ropebwt2 rows).

Thanks! I'll check it out

Really appreciate your timely responses

LYC-vio avatar Jul 24 '23 16:07 LYC-vio

It is somehow related to the ropebwt2 implementation we use. I tried to dig into that some time ago and I found out that it was using 4 additional threads (check here: https://github.com/lh3/ropebwt2/blob/bd8dbd3db2e9e3cff74acc2907c0742c9ebbf033/mrope.c#L287). I don't recall the details now, but it was something like one thread per nucleotide.. but take this with a grain of salt

ldenti avatar Jul 24 '23 16:07 ldenti

Don't know if this may help, I just simulated some 100bp-long read samples and these are the results:

#reads File Size Time (s) RAM Index Size
131 072 32M 2 65M 13M
524 288 125M 8 239M 44M
1 048 576 249M 15 454M 81M
4 194 304 998M 55 1.6G 244M
8 388 608 2.0G 120 3.1G 409M
33 554 432 7.9G 586 12.1G 1.4G

Growth seems almost linear but I don't know if we can fully trust these results 😃

Please, let me know once you have the index, how long it took (if you have that info)

ldenti avatar Jul 25 '23 14:07 ldenti