rgi
rgi copied to clipboard
depreciate -split_prodigal_jobs
Analysis of simulated validation data by the CARD team (unpublished) revealed that Prodigal undercalls ORFs when the -split_prodigal_jobs option is used. This was particularly noticeable in a Acinetobacter baumannii genomic context. As this leads to false negatives, please depreciate support of -split_prodigal_jobs in RGI.
Looking at the implementation, that is likely a product of not generating and using a single training file for the whole genome before running the subjob ORF calling. This means each split is tuning the ORF finding model on only the sequence subset it gets thus lower accuracy.
Similar issue can occur if running on a large set of related genomes. Low quality genomes will have even worse ORF calling because the model trained on them will be poorer. Training on all genomes then using that training file would maximise accuracy/consistency (or moving to a ggcaller approach!)
Issue is stale and will be closed in 7 days unless there is new activity
Re-opening to assess if we should handle training better or depreciate.
Issue is stale and will be closed in 7 days unless there is new activity