rgi icon indicating copy to clipboard operation
rgi copied to clipboard

depreciate -split_prodigal_jobs

Open agmcarthur opened this issue 10 months ago • 4 comments

Analysis of simulated validation data by the CARD team (unpublished) revealed that Prodigal undercalls ORFs when the -split_prodigal_jobs option is used. This was particularly noticeable in a Acinetobacter baumannii genomic context. As this leads to false negatives, please depreciate support of -split_prodigal_jobs in RGI.

agmcarthur avatar Apr 17 '24 20:04 agmcarthur

Looking at the implementation, that is likely a product of not generating and using a single training file for the whole genome before running the subjob ORF calling. This means each split is tuning the ORF finding model on only the sequence subset it gets thus lower accuracy.

Similar issue can occur if running on a large set of related genomes. Low quality genomes will have even worse ORF calling because the model trained on them will be poorer. Training on all genomes then using that training file would maximise accuracy/consistency (or moving to a ggcaller approach!)

fmaguire avatar Apr 17 '24 21:04 fmaguire

Issue is stale and will be closed in 7 days unless there is new activity

github-actions[bot] avatar Jun 17 '24 11:06 github-actions[bot]

Re-opening to assess if we should handle training better or depreciate.

agmcarthur avatar Jul 04 '24 15:07 agmcarthur

Issue is stale and will be closed in 7 days unless there is new activity

github-actions[bot] avatar Sep 03 '24 11:09 github-actions[bot]