gapseq icon indicating copy to clipboard operation
gapseq copied to clipboard

Batch tblastn?

Open jdwinkler-lanzatech opened this issue 3 years ago • 3 comments

Hi again!

I noticed while doing some testing of gapseq that it seems to run tblastn iteratively over the reference set. Would it be possible to change it so your concatenated reference sequences are all blasted against the genome reference at once? I think you can supply threads to tblastn so it would map faster (if the user specifies a thread count >1 at least). I think the mapping step could be speed up a lot this way.

If you are still planning to switch over to protein-protein mapping for identifying reactions present in the model though, this approach wouldn't be necessary.

jdwinkler-lanzatech avatar Mar 05 '21 14:03 jdwinkler-lanzatech

Hi @jdwinkler-lanzatech thanks for you suggestions! You are right and something like this (either blast everything at one and/or diamond) will be part of an upcoming release! A potential approach for until this is done could be to have multiple gapseq version running at the same time. This scales quite well and can produce a lot models on a cluster depending on available cores in few hours.

jotech avatar Mar 08 '21 15:03 jotech

Yep! I'm set up to run instances in parallel, but was just I thought I had watching the initial testing.

jdwinkler-lanzatech avatar Mar 16 '21 20:03 jdwinkler-lanzatech

Hi,

I was wondering something along the same lines, and whether there was a way to specify an increased number of threads to speed up the process.

Thank you!

susheelbhanu avatar Apr 28 '21 08:04 susheelbhanu

In the current development version, you can specify the number of cores using the "-K" option in the modules gapseq find and gapseq find-transport. Also, supplying a protein fasta instead of a nucleotide fasta genome can speed up the process.

Waschina avatar Oct 23 '22 07:10 Waschina