gapseq
gapseq copied to clipboard
Batch tblastn?
Hi again!
I noticed while doing some testing of gapseq that it seems to run tblastn iteratively over the reference set. Would it be possible to change it so your concatenated reference sequences are all blasted against the genome reference at once? I think you can supply threads to tblastn so it would map faster (if the user specifies a thread count >1 at least). I think the mapping step could be speed up a lot this way.
If you are still planning to switch over to protein-protein mapping for identifying reactions present in the model though, this approach wouldn't be necessary.
Hi @jdwinkler-lanzatech thanks for you suggestions! You are right and something like this (either blast everything at one and/or diamond) will be part of an upcoming release! A potential approach for until this is done could be to have multiple gapseq version running at the same time. This scales quite well and can produce a lot models on a cluster depending on available cores in few hours.
Yep! I'm set up to run instances in parallel, but was just I thought I had watching the initial testing.
Hi,
I was wondering something along the same lines, and whether there was a way to specify an increased number of threads to speed up the process.
Thank you!
In the current development version, you can specify the number of cores using the "-K" option in the modules gapseq find
and gapseq find-transport
. Also, supplying a protein fasta instead of a nucleotide fasta genome can speed up the process.