anvio
anvio copied to clipboard
[BUG] Gigantic files created by diamond search
Short description of the problem
I created a genomes storage consisting of ~2k genomes, and when I tried to run anvi-pan-genome
, it exceeds my disk quota on the cluster and the job fails. It creates a gigantic file 'diamond-search-results.txt.ununiqued' (~2TB) and 'diamond-search-results.txt' (~26GB). Is there a way to limit the size of the intermediate files? Here's the command I ran
anvi-pan-genome -g PSAE2009-GENOMES.db -n PSAE2009 -T 32 --exclude-partial-gene-calls --mcl-inflation 10 -o PSAE2009_pangenome
anvi'o version
Anvi'o .......................................: hope (v7) Profile database .............................: 35 Contigs database .............................: 20 Pan database .................................: 14 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 2 tRNA-seq database ............................: 1
System info
Using WSL, and installed with conda.
Detailed description of the issue
After discussing with Meren on its merenlab webpage, it was proposed mmseqs for large pangenome analysis, or adjust the default parameters so Diamond does not report every single hit.
Thank you for reporting this! Notes for anyone who is interested in working on this issue:
- [ ] I think we should include an
--e-value
flag inanvi-pan-genome
program to eliminate the vast majority of extremely weak hits prior to the MCL step. - [ ] I also think we should investigate mmseqs as an alternative way to identify gene clusters.