metaeuk icon indicating copy to clipboard operation
metaeuk copied to clipboard

limit / reduce disk space usage

Open chrishah opened this issue 3 years ago • 3 comments

Expected Behavior

successfully run easy-predict on large chromosome-level genome assembly (within BUSCO)

Current Behavior

metaeuk runs, but runs out of disk space (5TB), even if I impose a --disk-space-limit of 3TB

Steps to Reproduce (for bugs)

Don't think there's a bug - just looking for a way to limit disk space usage. I have access to a server with 2x Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (14 Cores / 28 threads per CPU), with 1.5T RAM and atm 5TB of disk space.

Command (within BUSCO): metaeuk easy-predict --threads 14 Neoceratodus_forsteri.fna run_vertebrata_odb10/metaeuk_output/refseq_db_rerun.faa run_vertebrata_odb10/metaeuk_output/rerun_results/Neoceratodus_forsteri.fna run_vertebrata_odb10/metaeuk_output/tmp --max-intron 130000 --max-seq-len 160000 --min-exon-aa 5 --max-overlap 5 --min-intron 1 --overlap 1 -s 6 --slice-search 1 --remove-tmp-files 1 --disk-space-limit 3000G --split-mode 0 --split-memory-limit 1500G

last few parameters from 'slice-search' onwards, were my attempts to limit/reduce disk space usage and limit RAM usage. The rest I can't control - this is BUSCO behaviour.

Context

Running metaeuk as part of the BUSCO pipeline (v5.2.1) on a publicly available large Eukaryote genome (Australian lungfish)

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MetaEuk Version:" when you execute MetaEuk without any parameters): metaeuk Version: 4.a0f584d

chrishah avatar Jan 24 '22 09:01 chrishah

Hi, I saw that there is the --compress 1 option, and that it has been fixed in the latest release (issue #20). I am assuming this will reduce disk usage and will try it as soon as my current run is done - I have one running with 6TB disk space available now. If you have any other suggestions on how to reduce disk space, please let me know - thanks!

Could I ask also what --slice-search 1 is actually doing. I found it in some post somewhere as a suggestion when RAM is limiting, so I am using it, but don't really know how it affects the run or if it really is helpful in my situation. Thanks!

cheers, Christoph

chrishah avatar Jan 26 '22 08:01 chrishah

Hi, So, the last run with Version: 4.a0f584d actually finished successfully. The only thing I changed was to reduce the number of threads from 14 to 10. I noticed before that metauk writes large files in tmp directories (*.pred, *.aln) during the process and that the files are numbered 0 - nthreads-1, so I thought if I reduce the number of threads this might reduce the amount of data written to disk. With 14 threads I ran out of disk space at 5T disk usage. With 10 threads the maximum disk usage was 2.6T. I don't really understand but these were my observations and I am happy that it ran through in the end. It ran for about 170 hours. With respect to RAM the limit I imposed with --split-memory-limit 1500G seemed to have worked nicely - metaeuk maxed out the RAM totally at times (rss 1.5T) but didn't run out. Thanks!

cheers, Christoph

chrishah avatar Feb 02 '22 19:02 chrishah

Thank you very much for the feedback and I apologize for the late reply. I am glad you got it to run. We implemented the logic to limit disk space usage in MMseqs2 (the library MetaEuk uses) and it was quite demanding in terms of the possible scenarios it had to cover. The behavior you describe strongly indicates something is not fully working there. I will open an issue for MMseqs2 and refer to this issue. I hope we can get to this in future versions.

elileka avatar Feb 07 '22 14:02 elileka