anvio icon indicating copy to clipboard operation
anvio copied to clipboard

[BUG] Gigantic files created by diamond search

Open genomesandMGEs opened this issue 3 years ago • 1 comments

Short description of the problem

I created a genomes storage consisting of ~2k genomes, and when I tried to run anvi-pan-genome, it exceeds my disk quota on the cluster and the job fails. It creates a gigantic file 'diamond-search-results.txt.ununiqued' (~2TB) and 'diamond-search-results.txt' (~26GB). Is there a way to limit the size of the intermediate files? Here's the command I ran

anvi-pan-genome -g PSAE2009-GENOMES.db -n PSAE2009 -T 32 --exclude-partial-gene-calls --mcl-inflation 10 -o PSAE2009_pangenome

anvi'o version

Anvi'o .......................................: hope (v7) Profile database .............................: 35 Contigs database .............................: 20 Pan database .................................: 14 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 2 tRNA-seq database ............................: 1

System info

Using WSL, and installed with conda.

Detailed description of the issue

After discussing with Meren on its merenlab webpage, it was proposed mmseqs for large pangenome analysis, or adjust the default parameters so Diamond does not report every single hit.

genomesandMGEs avatar Apr 13 '21 09:04 genomesandMGEs

Thank you for reporting this! Notes for anyone who is interested in working on this issue:

  • [ ] I think we should include an --e-value flag in anvi-pan-genome program to eliminate the vast majority of extremely weak hits prior to the MCL step.
  • [ ] I also think we should investigate mmseqs as an alternative way to identify gene clusters.

meren avatar Apr 14 '21 21:04 meren