diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Diamond blastx out-of-memory

Open Tom-Jenkins opened this issue 3 years ago • 21 comments

Hi, I want to run diamond blastx on a nr protein database created using the following commands:

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
diamond makedb --in nr.gz -d nr

My query is a 1.7G FASTA file and the nr.dnmd database file is 153G. According to the logfile of prior runs, "The host system is detected to have 134 GB of RAM".

However, I keep getting errors (not always the same error), which all seem to be related to memory. I have adjusted the -b and -c parameters but I still get errors related to memory. I have attached the logfile of my latest run and was hoping you could help me solve this issue. Thank you in advance.

diamond blastx -d ~/aqualeap/databases/nr -q ../curated.fasta --outfmt 100 -o diamond_nr_contigs.daa -t /tmp/ --salltitles -F 15 --range-culling --top 10 -p 16 -b 0.4

Error:

Computing alignments... /var/spool/slurmd/job87627/slurm_script: line 14:   431 Killed                  diamond blastx -d ~/aqualeap/databases/nr -q ../curated.fasta -a diamond_nr_contigs.daa -t /tmp/ --salltitles -F 15 --range-culling --top 10 -p 16 -b 0.4
slurmstepd: error: Detected 1 oom-kill event(s) in step 87627.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

slurm-87627.out.txt

Tom-Jenkins avatar Oct 14 '20 10:10 Tom-Jenkins

How much memory have you allocated to the job in your slurm submit script? It could be that frameshift alignments or range culling lead to increased memory usage. Could you try without these options? How long is your longest query?

bbuchfink avatar Oct 14 '20 11:10 bbuchfink

Here is my slurm script: #!/bin/bash

#SBATCH --export=ALL # export all environment variables to the batch job #SBATCH -D . # set working directory to . #SBATCH -p pq # submit to the parallel queue #SBATCH --time=12:00:00 # maximum walltime for the job #SBATCH -A Research_Project-T109743 # research project to submit under #SBATCH --nodes=1 # specify number of nodes #SBATCH --ntasks-per-node=16 # specify number of processors per node #SBATCH -p highmem

I have used both the high memory node (32 cores, 3 TB) and the standard node (16 cores, 128 GB) and got the same errors. Do I need to ask for more memory, even on the high memory node?

My longest query is 15.5 Mbp. I have just submitted the script without the -F and --range-culling parameters and it does seem to be running OK so far. diamond blastx -d ~/aqualeap/databases/nr -q ../curated.fasta --outfmt 100 -o diamond_nr_contigs.daa -t /tmp/ --salltitles --top 10 -p 16

Tom-Jenkins avatar Oct 14 '20 11:10 Tom-Jenkins

Yes, I think you probably need to request more memory in your submit script.

bbuchfink avatar Oct 14 '20 11:10 bbuchfink

Unfortunately, even with the high memory node and 1000G memory (the maximum I can request) it runs out of memory after 4 1/2 hours of run time. My slurm script is below and I've attached the logfile. Is there any way I can execute diamond blastx with these files using -F and --range-culling without consuming so much memory? I have also tried adjusting the -b parameter to 1 but that doesn't seem to help.

#!/bin/bash

#SBATCH --export=ALL # export all environment variables to the batch job
#SBATCH -D . # set working directory to .
#SBATCH -p pq # submit to the parallel queue
#SBATCH --time=12:00:00 # maximum walltime for the job
#SBATCH -A Research_Project-T109743 # research project to submit under
#SBATCH --nodes=1 # specify number of nodes
#SBATCH --ntasks-per-node=16 # specify number of processors per node
#SBATCH -p highmem
#SBATCH --mem=1000G
#SBATCH --mail-type=END # send email at job completion
#SBATCH [email protected] # email address

# Commands
diamond blastx -d ~/aqualeap/databases/nr -q ../curated.fasta --outfmt 100 -o diamond_nr_contigs.daa -t /tmp/ --salltitles -F 15 --range-culling --top 10 -p 16 -c 1 -b 10

slurm-88567.out.txt

Tom-Jenkins avatar Oct 15 '20 13:10 Tom-Jenkins

I'm not sure what causes this high memory use and will have to look into it. If you want, you can send me your query file so I can try to reproduce your run.

bbuchfink avatar Oct 15 '20 13:10 bbuchfink

Thank you for looking into this. The file is too big to upload, can I send it to your email via WeTransfer?

Tom-Jenkins avatar Oct 16 '20 09:10 Tom-Jenkins

Sure, my email is [email protected]

bbuchfink avatar Oct 16 '20 09:10 bbuchfink

It was the DP matrices in traceback mode that were using up too much memory. This should fix the issue: 199cd79732060bda7356417e8ae8c27e75157570

Using this I was able to run your dataset with about 40 GB of memory use (with the default block size of 2, which is fine).

bbuchfink avatar Oct 18 '20 11:10 bbuchfink

Sorry to be a nuisance, but I still seem to have an error after re-installing diamond and re-running diamond blastx.

Computing alignments... /var/spool/slurmd/job95338/slurm_script: line 15: 26030 Bus error (core dumped) diamond blastx -d ~/aqualeap/databases/nr -q ../curated.fasta --outfmt 100 -o diamond_nr_contigs2.daa -t /tmp/ --salltitles -F 15 --range-culling --top 10 -p 8 Isca HPC: Slurm Job_id=95338 Name=isca-diamond2.sh Ended, Run time 00:22:45, FAILED, ExitCode 135

Slurm script:

#!/bin/bash

#SBATCH --export=ALL # export all environment variables to the batch job
#SBATCH -D . # set working directory to .
#SBATCH -p pq # submit to the parallel queue
#SBATCH --time=24:00:00 # maximum walltime for the job
#SBATCH -A Research_Project-T109743 # research project to submit under
#SBATCH --nodes=1 # specify number of nodes
#SBATCH --ntasks-per-node=8 # specify number of processors per node
#SBATCH -p highmem
#SBATCH --mail-type=END # send email at job completion
#SBATCH [email protected] # email address

# Commands
diamond blastx -d ~/aqualeap/databases/nr -q ../curated.fasta --outfmt 100 -o diamond_nr_contigs2.daa -t /tmp/ --salltitles -F 15 --range-culling --top 10 -p 8

I have attached the logfile. slurm-95338.out.txt

Tom-Jenkins avatar Oct 21 '20 14:10 Tom-Jenkins

Bus error does not seem like a memory problem any more. How much free space does your /tmp/ folder have?

bbuchfink avatar Oct 21 '20 15:10 bbuchfink

I just re-ran the same command but without the /tmp/ and got this error:

Computing alignments... /var/spool/slurmd/job95541/slurm_script: line 15: 831 Killed diamond blastx -d ~/aqualeap/databases/nr -q ../curated.fasta --outfmt 100 -o diamond_nr_contigs2.daa --salltitles -F 15 --range-culling --top 10 -p 8 slurmstepd: error: Detected 1 oom-kill event(s) in step 95541.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

In terms of free space, I have quite a lot:

Filesystem      Size  Used Avail Use% Mounted on
ts0              10T  6.3T  3.8T  63% /gpfs/ts0

Tom-Jenkins avatar Oct 21 '20 15:10 Tom-Jenkins

Not sure since I tested it with the same file and it worked fine. Please double check that you have cloned the latest version of the repo, compiled from source and are running that version of Diamond.

bbuchfink avatar Oct 21 '20 15:10 bbuchfink

Hi,

I have a similar problem. I have been using Diamond, first version v2.0.9, no v2.0.12, on a group to search for sequences on a collection of assemblies. The sequences I collected myself, and are about 150 and range around 1500 bp. Things work fine for the most part. However, on some assemblies that are bigger, I ran out of memory, even when using 1000G. Initially, I used block-size 6 and index-chunks 1. But reading the above comments I modified it to block-size 2 and index-chunks 4. The documentation mentions that these parameters are pivotal for performance and memory usage. Should I understand from this statement that if I tunned them down on performance the memory usage will reduce?

It is worth mentioning that I am also using frameshift 15 as we discussed on #458.

This is my current setup:

gzip --decompress --stdout ${inDir}/${assemblyT}.fasta.gz | \
  diamond blastx \
    --db ${libraryDir}/${libraryT}.dmnd \
    --query - \
    --frameshift 15 \
    --block-size 2 \
    --index-chunks 4 \
    --out ${outDir}/${species}.tsv

Any suggestion would be highly appreciated.

DanielRivasMD avatar Oct 25 '21 06:10 DanielRivasMD

How big are your assemblies and how many threads do you run?

bbuchfink avatar Oct 25 '21 07:10 bbuchfink

Thanks for your reply. I run 16 CPU threads with 1000G memory, but with less memory (128GB) I could run 28 CPU threads. One of the assemblies uncompressed is 3.6 G

DanielRivasMD avatar Oct 25 '21 07:10 DanielRivasMD

How long is the longest contig?

bbuchfink avatar Oct 25 '21 07:10 bbuchfink

For this particular assembly this are the specs:

karyotype: 				2n=18

contigN50: 				107,955
totalContigLength:		3,499,615,818
longestContig:			1,055,336
numberOfContigs:		72,993

scaffoldN50:			524,289,849
totalScaffoldLength:	3,573,327,505
longestScaffold:		747,302,727
numberOfScaffolds:		5,136

DanielRivasMD avatar Oct 25 '21 07:10 DanielRivasMD

The longest I tested were bacterial chromosomes, but queries of >700 MB can easily break the current code. I do plan to rework the blastx mode which will probably happen in the next weeks, but I can't offer you an easy solution now. These may be options that work:

Extract ORFs and run the blastp mode on them. Chop the sequences into overlapping ~100kb windows and run blastx on them.

bbuchfink avatar Oct 25 '21 08:10 bbuchfink

I see. I had thought about the second option. Other alternative that I considered was to run each scaffold independently, but I guess this would not work since the problem seems to be the length, correct?

I will try as you suggest. Thanks a lot for your input, and please let me know when you update blastx.

DanielRivasMD avatar Oct 25 '21 08:10 DanielRivasMD

You could try that too but I assume that the length is the problem.

May I also ask why extracting ORFs is not an option for you? Are you looking for alignments that span over stop codons?

bbuchfink avatar Oct 25 '21 08:10 bbuchfink

I thought so.

I will definitely try extracting ORFs as well. I just had not thought about it.

DanielRivasMD avatar Oct 25 '21 08:10 DanielRivasMD