spaln Question on job parallelization

Hello Osamu

First off, thanks a lot for the great program and continued support/enhancement on this.

I had a question on job parallelization. Assume I have a protein query file of 15K sequences and I used 2 approaches

Approach 1 - all 15K query sequences in 1 file - goes into Job 1a followed by sortgrcd to get gff3 files
Approach 2 - split the 15K sequences into 3 files, each file containing 5K sequences - goes into Job 2a, 2b, 2c followed by sortgrcd

In both cases, spaln was called appropriately after formatting the database:

spaln -t20 -Q7 -O12 -M1 [other options] -dDatabase Query for Job 1a
spaln -t20 -Q7 -O12 -M1 [other options] -dDatabase Query_1[2,3] for Job 2a, Job 2b, Job 2c for each the appropriate query files

The question: Will there be any major differences with the 2 outputs

Output of Approach 1 - sortgrcd -P40 -C50 -O0 Query.grd > spaln_single_job.gff3
Output of Approach 2- sortgrcd -P40 -C50 -O0 Query_1.grd Query_2.grd Query_3.grd > spaln_multi_job.gff3 -- this is done after ensuring all the relevant *.{erd, qrd} files are in the same directory as well as ensuring that *.{ent, idx, grp, seq} files of the database are also present in the directory where the sortgrcd job is running

I did look thru' both outputs in many different ways and could not find any differences. I am going to productionize a pipeline and I felt I should ask you if there would be any specific caveats I should be aware of if I use Approach 2

Thanks in advance,

Aug 12 '21 02:08 tamuanand

Thank you for your interest in Spaln. Frankly speaking, I have not used spaln and sortgrcd in the way that you suggested, after the time when spaln supported multi-thread operations; In my environment, I cannot easily use cluster machines. So, probably you know better than me about the performance of the combined use of spaln and sortgrcd under multi-machine environments.

However, please wait a few days before you start your large-scale calculation. I have found a few bugs that can cause segmentation faults (see issue #41) in rare situations. I have fixed them and am now testing the modified version on real data. I will announce you through this medium when I release the fixed version.

Osamu,

Aug 16 '21 09:08 ogotoh

Thanks a lot Osamu.

I would like to wait for your new/modified version of spaln.

Aug 17 '21 01:08 tamuanand

Although it took unexpectedly long time, I have finished modification of spaln. Tested upon more than 100 pairs of genomic and assembled transcript DNA sequences in the DDBJ database of various sequence similarity levels, the new version (Ver.2.4.6) runs without segmentation faults. For protein queries, tests have not been done in this detail. However, it works fine for a few examples. Thus, I wanted not to further delay the release of this version.

I thank you for your patience. If you encounter any problems with this or previous versions of spaln, please let me know at your convenience.

Osamu,

Sep 14 '21 02:09 ogotoh

spaln spaln copied to clipboard

Question on job parallelization

spaln
spaln copied to clipboard