diamond icon indicating copy to clipboard operation
diamond copied to clipboard

large files

Open alaraints opened this issue 4 years ago • 8 comments

Hello, when analyzing large input files (20-60 GB), Diamond invariably crashes in the end, I suppose when it tis time to write the results. I am running with 60 GB memory, block size 6, so there is plenty of memory for most of the time. Is there a way to make Diamond write output sequentially? Or is there some other problem? Max input file size limit? Best regards,

Alar Aints

alaraints avatar Nov 29 '21 07:11 alaraints

How many sequences do your files contain?

bbuchfink avatar Nov 29 '21 08:11 bbuchfink

100 Million - 200 Million. Trying to run BlastX against Refseq

Greetings,

Alar

On 29 Nov 2021, at 10:58, Benjamin Buchfink @.***> wrote:

How many sequences do your files contain?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bbuchfink/diamond/issues/533#issuecomment-981415885, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOD2XPRZED63QVMVR4PRNZTUOM6D3ANCNFSM5I6OSQCQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

alaraints avatar Nov 29 '21 09:11 alaraints

You may want to try a smaller block size. Otherwise, I'm not sure why this might crash and need to run some tests, but this may take time.

bbuchfink avatar Nov 29 '21 13:11 bbuchfink

Hello Dr Buchfink,

the error persists. I have upgraded Conda and Diamond, reduced the block size to 5 and input file size to 10 million reads, but still the program crashes. The last two lines of the report file are always:

Opening temporary output file... [0.101s] Computing alignments... /var/spool/slurm/slurmd/job24517243/slurm_script: line 516: 174051 Bus error

Any suggestions?

Best regads,

Alar Aints

On 29 Nov 2021, at 15:15, Benjamin Buchfink @.***> wrote:

You may want to try a smaller block size. Otherwise, I'm not sure why this might crash and need to run some tests, but this may take time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bbuchfink/diamond/issues/533#issuecomment-981624557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOD2XPXI6MUEUIWIKQRZQ5LUON4HXANCNFSM5I6OSQCQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

alaraints avatar Dec 07 '21 16:12 alaraints

Hello again - I ran the script again with the same Diamond paramaters, but a small file, 60 k reads. This time it worked. Took 21GB memory. It appears to me that the program is trying to allocate memory based on the file size, not the block size, for computing alignements. (Should it even compute alignements when --outfmt 6 is specified?) Best regards,

Alar Aints.

Hello Dr Buchfink,

the error persists. I have upgraded Conda and Diamond, reduced the block size to 5 and input file size to 10 million reads, but still the program crashes. The last two lines of the report file are always:

Opening temporary output file... [0.101s] Computing alignments... /var/spool/slurm/slurmd/job24517243/slurm_script: line 516: 174051 Bus error

Any suggestions?

Best regads,

Alar Aints

On 29 Nov 2021, at 15:15, Benjamin Buchfink @.*** @.***>> wrote:

You may want to try a smaller block size. Otherwise, I'm not sure why this might crash and need to run some tests, but this may take time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bbuchfink/diamond/issues/533#issuecomment-981624557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOD2XPXI6MUEUIWIKQRZQ5LUON4HXANCNFSM5I6OSQCQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

alaraints avatar Dec 10 '21 10:12 alaraints

The memory use should only depend on block size, not file size. You can try to further reduce the block size. Another option to reduce memory use is --bin, for example you can try --bin 64.

bbuchfink avatar Dec 13 '21 13:12 bbuchfink

Hello, thank you for the reply - however, it is quite evident that the memory use depends on the file size. I have now successfully processed some files and monitored the SLURM performance using Kibana. The first set of peaks corresponds to the first file of 654 MB, 3 393 711 contigs; the last three peaks correspond to the second file, 557 MB, 2 854 680 contigs. Memory use is shown as % of 60 GB. The peaks correspond to computing alignements. The first peak use is 74%, 44.4 GB, plateau 36%, 21.6 GB. The second set of peaks are 66%, 40 GB; plateau 35%, 21 GB. Script: #SBATCH --cpus-per-task=6 #SBATCH --mem=60000 Command line: diamond blastx --query contigs.fasta --db refseq --out DX_Match.txt --unal 0 --min-orf 1 --un LO.txt -b 5.0 --sensitive -k 1 --threads 6 --evalue 0.01 --max-hsps 2 --outfmt 6

Best regards,

Alar Aints.

On 13 Dec 2021, at 15:48, Benjamin Buchfink @.***> wrote:

The memory use should only depend on block size, not file size. You can try to further reduce the block size. Another option to reduce memory use is --bin, for example you can try --bin 64.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bbuchfink/diamond/issues/533#issuecomment-992492472, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOD2XPQWZ2DLBFJ4OINASPLUQX2RDANCNFSM5I6OSQCQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

alaraints avatar Dec 13 '21 15:12 alaraints

Running blastx on contigs is a different story, unfortunately the current implementation can't handle very long queries well. Using the frameshift mode (-F 15) should work better in these cases.

bbuchfink avatar Dec 13 '21 15:12 bbuchfink