diamond Idea for potential runtime improvements

Idea for potential runtime improvements

Open oschwengers opened this issue 2 years ago • 7 comments

Hi Benjamin, thanks a lot for developing and maintaining diamond.

By monitoring tons of Diamond runs from Bakta (2k<=query size<=10k, db size=~90000000, >=90% identity, >=80% query/subject coverage, --fast), I saw that a significant proportion of the the overall runtime is used to sequentially read and process DB chunks single threaded which is o/c due to diamond's workflow/algorithm.

But would it be possible to constantly preload the following db chunk X+1 concurrently to the multithreaded alignment of chunk X? Couldn't this reduce the interspersed single core parts of the overall workflow and thus reduce the overall runtime for cases with many DB chunks? For example, this might be implemented either by a distinct spared thread for the DB loading and N-1 for actual alignments or in a competitive concurrent manner where alignment threads and DB loading threads compete for CPU.

I know that even though this might sound trivial in first place, this could actually be pretty hard or annoying or maybe even impossible to implement. So please feel free to just discard this if it is not appropriate, feasible or too demanding in terms of required code amendments. As I can't tell, I just wanted to share the idea and leave this to your judgement.

Thanks again and best regards! Oliver

Jan 24 '22 15:01 oschwengers

Hi Oliver, this could very well improve the runtime in such cases and would not be too difficult to implement. However, if you do a lot of these small query runs, the most efficient thing to do would be to combine many of these files into a larger query file. This would eliminate a lot more overhead than just the loading of the db. Is there anything standing in the way of this approach?

Jan 25 '22 09:01 bbuchfink

Thanks for the quick response. I'm glad this might actually work and could be done!

Sorry if I wasn't clear enough on our use cases. With "tons of Diamond runs" I mean independent runs that I monitored during the development/debugging of Bakta. So these runs belong to independent annotations (Bakta executions) of bacterial genomes. Therefore, query merging is unfortunately not an option.

Jan 25 '22 09:01 oschwengers

Ok, it should be possible to include this in a future release. Note that it may also help to use a BLAST database, as it takes advantage of memory mapping. Loading in the database sequences should be considerably faster (after some "warmup" runs to cache the database) than when using a .dmnd file.

Jan 26 '22 12:01 bbuchfink

Wonderful - I'm very much looking forward to this!

Although the highly fragmented nature of BLAST DBs is not ideal for the management (distribution, file checks, etc) for our setup, I'll definitely give it a try. Thanks a lot for the hint and considering this!

Jan 26 '22 12:01 oschwengers

Hello @bbuchfink, just out of curiosity: How is the parallel loading of db chunks going? We're very much looking forward to this Best regards

May 24 '22 15:05 oschwengers

Sorry I didn't find the time yet to work on this. But thanks for reminding me, I'll see what I can do.

May 25 '22 14:05 bbuchfink

Thanks for all the effort you put into this and taking this on. It's very much appreciated!

Jul 07 '22 09:07 oschwengers

diamond diamond copied to clipboard

Idea for potential runtime improvements

diamond
diamond copied to clipboard