diamond
diamond copied to clipboard
Idea for potential runtime improvements
Hi Benjamin, thanks a lot for developing and maintaining diamond.
By monitoring tons of Diamond runs from Bakta (2k<=query size<=10k, db size=~90000000, >=90% identity, >=80% query/subject coverage, --fast), I saw that a significant proportion of the the overall runtime is used to sequentially read and process DB chunks single threaded which is o/c due to diamond's workflow/algorithm.
But would it be possible to constantly preload the following db chunk X+1 concurrently to the multithreaded alignment of chunk X? Couldn't this reduce the interspersed single core parts of the overall workflow and thus reduce the overall runtime for cases with many DB chunks? For example, this might be implemented either by a distinct spared thread for the DB loading and N-1 for actual alignments or in a competitive concurrent manner where alignment threads and DB loading threads compete for CPU.
I know that even though this might sound trivial in first place, this could actually be pretty hard or annoying or maybe even impossible to implement. So please feel free to just discard this if it is not appropriate, feasible or too demanding in terms of required code amendments. As I can't tell, I just wanted to share the idea and leave this to your judgement.
Thanks again and best regards! Oliver
Hi Oliver, this could very well improve the runtime in such cases and would not be too difficult to implement. However, if you do a lot of these small query runs, the most efficient thing to do would be to combine many of these files into a larger query file. This would eliminate a lot more overhead than just the loading of the db. Is there anything standing in the way of this approach?
Thanks for the quick response. I'm glad this might actually work and could be done!
Sorry if I wasn't clear enough on our use cases. With "tons of Diamond runs" I mean independent runs that I monitored during the development/debugging of Bakta. So these runs belong to independent annotations (Bakta executions) of bacterial genomes. Therefore, query merging is unfortunately not an option.
Ok, it should be possible to include this in a future release. Note that it may also help to use a BLAST database, as it takes advantage of memory mapping. Loading in the database sequences should be considerably faster (after some "warmup" runs to cache the database) than when using a .dmnd
file.
Wonderful - I'm very much looking forward to this!
Although the highly fragmented nature of BLAST DBs is not ideal for the management (distribution, file checks, etc) for our setup, I'll definitely give it a try. Thanks a lot for the hint and considering this!
Hello @bbuchfink, just out of curiosity: How is the parallel loading of db chunks going? We're very much looking forward to this Best regards
Sorry I didn't find the time yet to work on this. But thanks for reminding me, I'll see what I can do.
Thanks for all the effort you put into this and taking this on. It's very much appreciated!