diamond
diamond copied to clipboard
How to estimate the memory requirements for Diamond given the database and query protein sizes?
On our new servers we have to request the amount of memory and time needed for a job. We are charged per thread per memory requirement for the time taken to complete the job (not the time requested). Anyways, I'm trying to minimize costs for a larger job.
I have a database that is 68G and 48170345 protein sequences (11GB gzipped, ~19GB uncompressed).
I can either do the following:
- Run Diamond against all of the proteins at once (I feel like this would be the most expensive)
- Split 100 files and run separately (each one is about 189MB)
Which method would use less resources?
How can I estimate how many resources would be required per job?
The memory needed depends on the options -b
and -c
, it is roughly 20*b/c. I would not recommend splitting into files this size since diamond is more efficient when running on larger files. ~2 GB is more reasonable unless you want to search at very high sensitivity.
I would not recommend splitting into files this size since diamond is more efficient when running on larger files. ~2 GB is more reasonable unless you want to search at very high sensitivity.
Oh ok good to know thank you.
The memory needed depends on the options -b and -c, it is roughly 20*b/c.
Are there any rough formulas you use when estimating memory consumption based on the -b -c parameters below, the database size, and the query size?
--block-size (-b) sequence block size in billions of letters (default=2.0)
--index-chunks (-c) number of chunks for index processing (default=4)
Are there any rough formulas you use when estimating memory consumption based on the -b -c parameters below
Yes, see above.
Hello
I have a large protein sequence file as below, sum_len 10,885,629,915 bp.
> file format type num_seqs sum_len min_len avg_len max_len
> non_redundancy_protein.fasta FASTA Protein 56,324,313 10,885,629,915 34 193.3 14,951
I use diamond to blastp with NCBI NR database as below:
nohup diamond blastp -d nr_20230728.dmnd -q ../07rm_redundancy/07partial_cdhit2/non_redundancy_protein.fasta --outfmt 6 --max-target-seqs 5 -e 1e-10 --query-cover 80 --id 50 --threads 140 -c 1 -b 16 -o diamond_annotation_nr.tsv > diamond_log.txt 2>&1 &
It seems diamond need too mang time to finish it, I'd like to know How mang query block will this command run?
I would appreciate your help with this question.
nohup: ignoring input
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
#CPU threads: 140
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory:
#Target sequences to report alignments for: 5
Opening the database... [0.074s]
Database: /home/adm/database/NCBI/NCBI_NR/nr_20230728.dmnd (type: Diamond database, sequences: 595907626, letters: 234169316349)
Block size = 16000000000
Opening the input file... [0.034s]
Opening the output file... [0s]
Loading query sequences... [56.861s]
Masking queries... [10.58s]
Algorithm: Double-indexed
Building query histograms... [7.472s]
Seeking in database... [0s]
Loading reference sequences... [30.694s]
Masking reference... [17.357s]
Initializing dictionary... [0.075s]
Initializing temporary storage... [0s]
Building reference histograms... [10.244s]
Allocating buffers... [0.001s]
Processing query block 1, reference block 1/15, shape 1/2.
Building reference seed array... [6.012s]
Building query seed array... [5.681s]
Computing hash join... [20.388s]
Masking low complexity seeds... [3.321s]
Searching alignments... [1395.4s]
Deallocating memory... [0s]
Processing query block 1, reference block 1/15, shape 2/2.
Building reference seed array... [4.54s]
Building query seed array... [6.068s]
Computing hash join... [31.181s]
Masking low complexity seeds... [2.292s]
Searching alignments... [1199.63s]
Deallocating memory... [0s]
Deallocating buffers... [9.142s]
Clearing query masking... [3.581s]
Opening temporary output file... [0s]
Computing alignments... Loading trace points... [353.293s]
Sorting trace points... [98.201s]
Computing alignments... [1444.16s]
Deallocating buffers... [20.527s]
Loading trace points... [0.014s]
Sorting trace points... [108.536s]
Computing alignments... [1457.22s]
Deallocating buffers... [31.078s]
Loading trace points... [0.036s]
Sorting trace points... [83.7s]
Computing alignments... [1138.63s]
Deallocating buffers... [11.271s]
Loading trace points... [0.036s]
Sorting trace points... [101.461s]
Computing alignments... [1432.87s]
Deallocating buffers... [21.436s]
Loading trace points... [0.047s]
Sorting trace points... [103.237s]
Computing alignments... [1348.66s]
Deallocating buffers... [21.272s]
Loading trace points... [0.007s]
Sorting trace points... [127.472s]
Computing alignments... [1707.93s]
Deallocating buffers... [24.559s]
Loading trace points... [0.034s]
Sorting trace points... [117.072s]
Computing alignments... [1555.41s]
Deallocating buffers... [19.418s]
Loading trace points... [0.049s]
Sorting trace points... [122.935s]
Computing alignments... [1554.81s]
Deallocating buffers... [23.619s]
Loading trace points... [0.023s]
Sorting trace points... [109.928s]
Computing alignments... [1468.24s]
Deallocating buffers... [19.654s]
Loading trace points... [0.032s]
Sorting trace points... [106.685s]
Computing alignments... [1403.99s]
Deallocating buffers... [22.997s]
Loading trace points... [0.049s]
Sorting trace points... [105.344s]
Computing alignments... [1378.31s]
Deallocating buffers... [17.975s]
Loading trace points... [0.041s]
Sorting trace points... [99.973s]
Computing alignments... [1339.73s]
Deallocating buffers... [15.575s]
Loading trace points... [0.006s]
Sorting trace points... [110.233s]
Computing alignments... [1421.66s]
Deallocating buffers... [25.309s]
Loading trace points... [0.03s]
Sorting trace points... [99.521s]
Computing alignments... [1433.63s]
Deallocating buffers... [17.191s]
Loading trace points... [0.01s]
Sorting trace points... [87.972s]
Computing alignments... [1277.04s]
Deallocating buffers... [12.884s]
Loading trace points... [0s]
Sorting trace points... [120.664s]
Computing alignments... [1293.89s]
Deallocating buffers... [22.838s]
Loading trace points... [0s]
[25040.5s]
Deallocating reference... [0.069s]
Loading reference sequences... [33.603s]
Masking reference... [16.284s]
Initializing dictionary... [0.077s]
Initializing temporary storage... [0.01s]
Building reference histograms... [10.543s]
Allocating buffers... [0.001s]
Processing query block 1, reference block 2/15, shape 1/2.
Building reference seed array... [6.667s]
Building query seed array... [6.038s]
Computing hash join... [22.306s]
Masking low complexity seeds... [2.497s]
Searching alignments... [1417.19s]
Deallocating memory... [0s]
Processing query block 1, reference block 2/15, shape 2/2.
Building reference seed array... [4.602s]
Building query seed array... [3.34s]
Computing hash join... [66.474s]
Masking low complexity seeds... [2.595s]
Searching alignments... [1208.88s]
Deallocating memory... [0s]
Deallocating buffers... [1.733s]
Clearing query masking... [3.206s]
Opening temporary output file... [0s]
Computing alignments... Loading trace points... [360.445s]
Sorting trace points... [122.79s]
Computing alignments... [1581.5s]
Deallocating buffers... [22.937s]
Loading trace points... [0.038s]
Sorting trace points... [131.016s]
Computing alignments... [1712.84s]