diamond icon indicating copy to clipboard operation
diamond copied to clipboard

How to estimate the memory requirements for Diamond given the database and query protein sizes?

Open jolespin opened this issue 2 years ago • 4 comments

On our new servers we have to request the amount of memory and time needed for a job. We are charged per thread per memory requirement for the time taken to complete the job (not the time requested). Anyways, I'm trying to minimize costs for a larger job.

I have a database that is 68G and 48170345 protein sequences (11GB gzipped, ~19GB uncompressed).

I can either do the following:

  1. Run Diamond against all of the proteins at once (I feel like this would be the most expensive)
  2. Split 100 files and run separately (each one is about 189MB)

Which method would use less resources?

How can I estimate how many resources would be required per job?

jolespin avatar Dec 13 '22 19:12 jolespin

The memory needed depends on the options -b and -c, it is roughly 20*b/c. I would not recommend splitting into files this size since diamond is more efficient when running on larger files. ~2 GB is more reasonable unless you want to search at very high sensitivity.

bbuchfink avatar Dec 19 '22 13:12 bbuchfink

I would not recommend splitting into files this size since diamond is more efficient when running on larger files. ~2 GB is more reasonable unless you want to search at very high sensitivity.

Oh ok good to know thank you.

The memory needed depends on the options -b and -c, it is roughly 20*b/c.

Are there any rough formulas you use when estimating memory consumption based on the -b -c parameters below, the database size, and the query size?

--block-size (-b)        sequence block size in billions of letters (default=2.0)
--index-chunks (-c)      number of chunks for index processing (default=4)

jolespin avatar Dec 19 '22 18:12 jolespin

Are there any rough formulas you use when estimating memory consumption based on the -b -c parameters below

Yes, see above.

bbuchfink avatar Dec 25 '22 09:12 bbuchfink

Hello

I have a large protein sequence file as below, sum_len 10,885,629,915 bp.

>  file                          format  type       num_seqs         sum_len  min_len  avg_len  max_len
> non_redundancy_protein.fasta  FASTA   Protein  56,324,313  10,885,629,915       34    193.3   14,951

I use diamond to blastp with NCBI NR database as below:

nohup diamond blastp -d nr_20230728.dmnd -q ../07rm_redundancy/07partial_cdhit2/non_redundancy_protein.fasta --outfmt 6 --max-target-seqs 5 -e 1e-10 --query-cover 80 --id 50 --threads 140 -c 1 -b 16 -o diamond_annotation_nr.tsv > diamond_log.txt 2>&1 &

It seems diamond need too mang time to finish it, I'd like to know How mang query block will this command run?

I would appreciate your help with this question.

nohup: ignoring input
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)


#CPU threads: 140
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory: 
#Target sequences to report alignments for: 5
Opening the database...  [0.074s]
Database: /home/adm/database/NCBI/NCBI_NR/nr_20230728.dmnd (type: Diamond database, sequences: 595907626, letters: 234169316349)
Block size = 16000000000
Opening the input file...  [0.034s]
Opening the output file...  [0s]
Loading query sequences...  [56.861s]
Masking queries...  [10.58s]
Algorithm: Double-indexed
Building query histograms...  [7.472s]
Seeking in database...  [0s]
Loading reference sequences...  [30.694s]
Masking reference...  [17.357s]
Initializing dictionary...  [0.075s]
Initializing temporary storage...  [0s]
Building reference histograms...  [10.244s]
Allocating buffers...  [0.001s]
Processing query block 1, reference block 1/15, shape 1/2.
Building reference seed array...  [6.012s]
Building query seed array...  [5.681s]
Computing hash join...  [20.388s]
Masking low complexity seeds...  [3.321s]
Searching alignments...  [1395.4s]
Deallocating memory...  [0s]
Processing query block 1, reference block 1/15, shape 2/2.
Building reference seed array...  [4.54s]
Building query seed array...  [6.068s]
Computing hash join...  [31.181s]
Masking low complexity seeds...  [2.292s]
Searching alignments...  [1199.63s]
Deallocating memory...  [0s]
Deallocating buffers...  [9.142s]
Clearing query masking...  [3.581s]
Opening temporary output file...  [0s]
Computing alignments... Loading trace points...  [353.293s]
Sorting trace points...  [98.201s]
Computing alignments...  [1444.16s]
Deallocating buffers...  [20.527s]
Loading trace points...  [0.014s]
Sorting trace points...  [108.536s]
Computing alignments...  [1457.22s]
Deallocating buffers...  [31.078s]
Loading trace points...  [0.036s]
Sorting trace points...  [83.7s]
Computing alignments...  [1138.63s]
Deallocating buffers...  [11.271s]
Loading trace points...  [0.036s]
Sorting trace points...  [101.461s]
Computing alignments...  [1432.87s]
Deallocating buffers...  [21.436s]
Loading trace points...  [0.047s]
Sorting trace points...  [103.237s]
Computing alignments...  [1348.66s]
Deallocating buffers...  [21.272s]
Loading trace points...  [0.007s]
Sorting trace points...  [127.472s]
Computing alignments...  [1707.93s]
Deallocating buffers...  [24.559s]
Loading trace points...  [0.034s]
Sorting trace points...  [117.072s]
Computing alignments...  [1555.41s]
Deallocating buffers...  [19.418s]
Loading trace points...  [0.049s]
Sorting trace points...  [122.935s]
Computing alignments...  [1554.81s]
Deallocating buffers...  [23.619s]
Loading trace points...  [0.023s]
Sorting trace points...  [109.928s]
Computing alignments...  [1468.24s]
Deallocating buffers...  [19.654s]
Loading trace points...  [0.032s]
Sorting trace points...  [106.685s]
Computing alignments...  [1403.99s]
Deallocating buffers...  [22.997s]
Loading trace points...  [0.049s]
Sorting trace points...  [105.344s]
Computing alignments...  [1378.31s]
Deallocating buffers...  [17.975s]
Loading trace points...  [0.041s]
Sorting trace points...  [99.973s]
Computing alignments...  [1339.73s]
Deallocating buffers...  [15.575s]
Loading trace points...  [0.006s]
Sorting trace points...  [110.233s]
Computing alignments...  [1421.66s]
Deallocating buffers...  [25.309s]
Loading trace points...  [0.03s]
Sorting trace points...  [99.521s]
Computing alignments...  [1433.63s]
Deallocating buffers...  [17.191s]
Loading trace points...  [0.01s]
Sorting trace points...  [87.972s]
Computing alignments...  [1277.04s]
Deallocating buffers...  [12.884s]
Loading trace points...  [0s]
Sorting trace points...  [120.664s]
Computing alignments...  [1293.89s]
Deallocating buffers...  [22.838s]
Loading trace points...  [0s]
 [25040.5s]
Deallocating reference...  [0.069s]
Loading reference sequences...  [33.603s]
Masking reference...  [16.284s]
Initializing dictionary...  [0.077s]
Initializing temporary storage...  [0.01s]
Building reference histograms...  [10.543s]
Allocating buffers...  [0.001s]
Processing query block 1, reference block 2/15, shape 1/2.
Building reference seed array...  [6.667s]
Building query seed array...  [6.038s]
Computing hash join...  [22.306s]
Masking low complexity seeds...  [2.497s]
Searching alignments...  [1417.19s]
Deallocating memory...  [0s]
Processing query block 1, reference block 2/15, shape 2/2.
Building reference seed array...  [4.602s]
Building query seed array...  [3.34s]
Computing hash join...  [66.474s]
Masking low complexity seeds...  [2.595s]
Searching alignments...  [1208.88s]
Deallocating memory...  [0s]
Deallocating buffers...  [1.733s]
Clearing query masking...  [3.206s]
Opening temporary output file...  [0s]
Computing alignments... Loading trace points...  [360.445s]
Sorting trace points...  [122.79s]
Computing alignments...  [1581.5s]
Deallocating buffers...  [22.937s]
Loading trace points...  [0.038s]
Sorting trace points...  [131.016s]
Computing alignments...  [1712.84s]

KJ-Ma avatar Nov 28 '23 02:11 KJ-Ma