How many processing query blocks are there in total?
Hello,
I am running diamond with these parameters:
for seq in `cat list_t.txt`
do
echo $seq
diamond blastx -d $DIAMOND_NR -q 'fastq-join/'$seq'_merged.fastq' -o $seq'.daa' -f 100
done
and in the terminal I see this:
#CPU threads: 20
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory:
#Target sequences to report alignments for: 25
Opening the database... [0.41s]
Database: /sw/data/diamond_databases/Blast/latest/nr.dmnd (type: Diamond database, sequences: 488073487, letters: 188883749278)
Block size = 2000000000
Opening the input file... [0.311s]
Opening the output file... [0.001s]
Loading query sequences... [25.424s]
Masking reference... [16.125s]
Initializing dictionary... [0.009s]
Initializing temporary storage... [0.009s]
Building reference histograms... [17.566s]
Allocating buffers... [0s]
Processing query block 1, reference block 1/95, shape 1/2, index chunk 1/4.
Building reference seed array... [5.396s]
Building query seed array... [4.49s]
Computing hash join... [4.521s]
Masking low complexity seeds... [1.689s]
Searching alignments... [27.369s]
Processing query block 1, reference block 1/95, shape 1/2, index chunk 2/4.
Building reference seed array... [6.79s]
... Processing for query block 1 took about one day. And I see that it has just started with query block 2. My question is: How many queries block will there be in total? How can I know this?
I would appreciate your help with this question.
Processing query block 2, reference block 13/95, shape 1/2, index chunk 1/4.
Building reference seed array... [5.431s]
Building query seed array... [4.65s]
Computing hash join... [5.072s]
Masking low complexity seeds... [1.868s]
Searching alignments... [28.261s]
Processing query block 2, reference block 13/95, shape 1/2, index chunk 2/4.
Building reference seed array... [6.71s]
Building query seed array... [5.689s]
Computing hash join... [4.538s]
Masking low complexity seeds... [1.597s]
thanks in advance
To compute the query blocks, take the number of DNA letters in the input file * 2, divided by the block size (2000000000 in your case).
OMG, that means that in my case my analysis have 7.69 query blocks.
My library have = 51307345 sequences * 150 pb = 7696101750 letters (pb), so, 7696101750 * 2 / 2000000000 = 7.69 query blocks
if my calculations are correct, my analysis will take about 6.9 days to process. right?
I will have to adjust some parameters to speed up the analysis process.
Yes seems correct. The easiest way to reduce runtime would to be used a smaller database if that works for you, e.g. the UniRef50 or annottree, see here: https://journals.asm.org/doi/full/10.1128/msystems.01408-21
To get diamond faster you can use -c1 and increase the block size e.g. -b4 or -b6, but this will also increase memory use.
Using global ranking can also help e.g. -g300 combined with -f 6 qseqid sseqid evalue to save time for traceback.
Thanks for your help.
I am running my analysis using: -c4 and -b8, and it seems to have reduced the time to 1.8 days per library :).
I will see your suggestions too.
(i tried with -c1, but I do not have sufficient memory, and I cannot apply for more memory this month at my institution :( ).
Thanks for your help.