spaln icon indicating copy to clipboard operation
spaln copied to clipboard

Question on job parallelization

Open tamuanand opened this issue 3 years ago • 3 comments

Hello Osamu

First off, thanks a lot for the great program and continued support/enhancement on this.

I had a question on job parallelization. Assume I have a protein query file of 15K sequences and I used 2 approaches

  • Approach 1 - all 15K query sequences in 1 file - goes into Job 1a followed by sortgrcd to get gff3 files
  • Approach 2 - split the 15K sequences into 3 files, each file containing 5K sequences - goes into Job 2a, 2b, 2c followed by sortgrcd

In both cases, spaln was called appropriately after formatting the database:

  • spaln -t20 -Q7 -O12 -M1 [other options] -dDatabase Query for Job 1a
  • spaln -t20 -Q7 -O12 -M1 [other options] -dDatabase Query_1[2,3] for Job 2a, Job 2b, Job 2c for each the appropriate query files

The question: Will there be any major differences with the 2 outputs

  • Output of Approach 1 - sortgrcd -P40 -C50 -O0 Query.grd > spaln_single_job.gff3
  • Output of Approach 2- sortgrcd -P40 -C50 -O0 Query_1.grd Query_2.grd Query_3.grd > spaln_multi_job.gff3 -- this is done after ensuring all the relevant *.{erd, qrd} files are in the same directory as well as ensuring that *.{ent, idx, grp, seq} files of the database are also present in the directory where the sortgrcd job is running

I did look thru' both outputs in many different ways and could not find any differences. I am going to productionize a pipeline and I felt I should ask you if there would be any specific caveats I should be aware of if I use Approach 2

Thanks in advance,

tamuanand avatar Aug 12 '21 02:08 tamuanand

Thank you for your interest in Spaln. Frankly speaking, I have not used spaln and sortgrcd in the way that you suggested, after the time when spaln supported multi-thread operations; In my environment, I cannot easily use cluster machines. So, probably you know better than me about the performance of the combined use of spaln and sortgrcd under multi-machine environments.

However, please wait a few days before you start your large-scale calculation. I have found a few bugs that can cause segmentation faults (see issue #41) in rare situations. I have fixed them and am now testing the modified version on real data. I will announce you through this medium when I release the fixed version.

Osamu,

ogotoh avatar Aug 16 '21 09:08 ogotoh

Thanks a lot Osamu.

I would like to wait for your new/modified version of spaln.

tamuanand avatar Aug 17 '21 01:08 tamuanand

Although it took unexpectedly long time, I have finished modification of spaln. Tested upon more than 100 pairs of genomic and assembled transcript DNA sequences in the DDBJ database of various sequence similarity levels, the new version (Ver.2.4.6) runs without segmentation faults. For protein queries, tests have not been done in this detail. However, it works fine for a few examples. Thus, I wanted not to further delay the release of this version.

I thank you for your patience. If you encounter any problems with this or previous versions of spaln, please let me know at your convenience.

Osamu,

ogotoh avatar Sep 14 '21 02:09 ogotoh