eggnog-mapper icon indicating copy to clipboard operation
eggnog-mapper copied to clipboard

Speeding up small-scale analysis?

Open jdwinkler-lanzatech opened this issue 2 years ago • 8 comments

Hi,

Thank you for your hard work on eggNOG, it is an excellent functional annotation tool! I wanted to ask a question that may be particular to how I am performing my analysis. I am essentially using eggNOG to annotated metagenome-assembled genomes (MAGs) containing a few thousand proteins each. I have enough of these MAGs that the diamond alignment step during annotation is a significant bottleneck during our annotation process. I read over the documentation about tweakable parameters, but I was wondering if you could give me any pointers about how to speed up alignment?

We have quite a bit of memory available on our server (512 GB) if that helps. I can try decreasing diamond's sensitivity but I wanted to check before I embarked on a possibly well-trodden optimization process.

I did check out the tips for large scale analysis but I don't think the (good!) suggestions apply in this particular case.

jdwinkler-lanzatech avatar Mar 16 '22 22:03 jdwinkler-lanzatech

Hi @jdwinkler-lanzatech ,

I guess you are running each MAG separately. Unfortunately, we have not implemented yet an option to annotate several input files at once. Depending on your scripting skills, you could try merging all the input fasta files into a single one, and run the diamond search just once. You would need to disentangle the results afterwards, though.

Regarding specific options, it depends on the emapper version. Which version are you using? The options which are more likely to help could be the ones listed here: https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.7#diamond-search-options

Most of these options mirror the diamond ones. You may need to check diamond documentation to fully understand them: https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options

You may take a look at --dmnd_algo, --dmnd_iterate, --block_size or --index_chunks.

From your post I understand that you have no problems with the annotation step. Just in case, use the --dbmem option during the annotation step, which makes it faster than without it.

Finally, depending on your infrastructure, you may prefer to use diamond separately (for example https://github.com/bbuchfink/diamond/wiki/6.-Distributed-computing#diamond-distributed-memory-parallel-processing) and then use the diamond results as input of an emapper --resume command.

I hope this is of help.

Best, Carlos

Cantalapiedra avatar Mar 17 '22 10:03 Cantalapiedra

Thanks for the suggestions! I may just throw some EC2 instances at the problem until it is solved essentially, but I'll do some experimentation to figure out how much faster I can get diamond to align.

jdwinkler-lanzatech avatar Mar 17 '22 13:03 jdwinkler-lanzatech

Glad to try to help.

Also, I am assuming that you are using your MAG proteins as input to run diamond in "blastp" mode. For instance, if you are using large MAG contigs with diamond in blastx mode it might be much slower (maybe unless you use the diamond frameshift mode...) than using Prodigal+Diamond or MMseqs2 in blastx mode.

Cantalapiedra avatar Mar 17 '22 18:03 Cantalapiedra

Yeah, I'm using prodigal + the emapper wrapper around diamond in protein mode I believe. Looks like the blastp mode is running.

jdwinkler-lanzatech avatar Mar 17 '22 18:03 jdwinkler-lanzatech

Hi @jdwinkler-lanzatech I am trying to annotate MAGs suing EggNOG, can u tell me the command that you used for annotation? It was quite confusing on the wiki page.

Thanks in Advance!!

saras224 avatar Mar 16 '23 13:03 saras224

Sure:

emapper.py -i {protein_fasta filepath} --cpu {number of cores/threads to use} -m diamond --output {output name} --output_dir {output folder}

Anything in {} you'll need to replace with your desired filepaths on your system, or the number of threads to use.

jdwinkler-lanzatech avatar Mar 16 '23 13:03 jdwinkler-lanzatech

Thanks for the prompt response!!! @jdwinkler-lanzatech :) :)

saras224 avatar Mar 16 '23 13:03 saras224

Hi @saras224 ,

Please, note that the command shared by @jdwinkler-lanzatech is for proteins as input. If your input are MAGs you should use a different command.

Just my 2 cents.

Best, Carlos

Cantalapiedra avatar Mar 17 '23 10:03 Cantalapiedra