rgi icon indicating copy to clipboard operation
rgi copied to clipboard

Use Pyrodigal instead of Prodigal for ORF prediction

Open althonos opened this issue 2 years ago • 3 comments

Hi @raphenya !

This PR proposes to replace Prodigal with Pyrodigal for running the ORF prediction stage. Pyrodigal is a Python library binding to Prodigal with additional performance enhancements. I'm the author of Pyrodigal, so ofc this is not a completely neutral list, but there are several advantages over Prodigal that I'll try to list down:

Single-threaded speed

Pyrodigal comes with a SIMD pre-filter to skip score computation for invalid gene pairs. This typically saves around half of the runtime for processing a genome in single mode (and more than that in metagenomic mode) on platforms with supported CPU features (SSE or NEON). I did a small writeup about this in the paper.

I ran some benchmarks on a single closed genome (NC_004129) to compare the runtime (still using BLAST for the downstream analysis):

Mode RGI w/ Prodigal RGI w/ Pyrodigal
Default 245s 205s
Low quality 340s 272s

Multi-threading

Pyrodigal supports re-entrant multithreading, so you can use multi-threaded ORF prediction even when running in single mode, contrary to what the code is currently doing with Prodigal where you only run multi-threaded prediction in --low_quality mode. This improves the runtime even more on fragmented genomes (e.g. 548.SAMN21245456):

Mode RGI w/ Prodigal RGI w/ Pyrodigal
Default 231s 153s
Low quality 241s 165s

Simpler installation

Contrary to Prodigal, Pyrodigal can be pip installed, so it's one less dependency to worry about for people who don't use conda. Otherwise it's also in Bioconda.

Same results

Despite the faster speed, Pyrodigal and Prodigal produce exactly[^1] the same output.

[^1]: Well, almost. During the refactor I found a bug in Prodigal that got all genes on the reverse strand to be penalized. It was fixed here but Prodigal never got a new release, so unless you recompile the code yourself you're still getting a buggy version. On the contrary, Pyrodigal contains the fix. So the "recompiled/fixed" Prodigal and Pyrodigal predict exactly the same thing (this is tested for), but the buggy Prodigal and Pyrodigal may occasionally diverge.

althonos avatar Oct 18 '22 15:10 althonos

@althonos Thank you, Martin! This looks awesome. I will review the code, but I think the best way is to have orf tools (i.e Prodigal and Pyrodigal) as an option. That way, it will be easy to compare and also in light of the anticipated Prodigal 3 release in the future.

raphenya avatar Oct 20 '22 13:10 raphenya

Fine by me! I updated the code to control the ORF finder based on the CLI, like for the aligner tool

althonos avatar Oct 21 '22 08:10 althonos

Please don't merge yet, I'm making some breaking API changes regarding output formatting in Pyrodigal, so I'll update the PR later to use Pyrodigal v2 after it's properly released.

althonos avatar Oct 22 '22 13:10 althonos

Just updated to v2.0, which has been verified to produce exactly the same results as Prodigal.

althonos avatar Nov 03 '22 11:11 althonos

Excited for this!

nickp60 avatar Nov 16 '22 17:11 nickp60

@althonos Thank you, I will merge away!

raphenya avatar Dec 07 '22 14:12 raphenya

Yay, thank you!

althonos avatar Dec 07 '22 15:12 althonos