rgi depreciate -split_prodigal

depreciate -split_prodigal_jobs

Open agmcarthur opened this issue 10 months ago • 4 comments

Analysis of simulated validation data by the CARD team (unpublished) revealed that Prodigal undercalls ORFs when the -split_prodigal_jobs option is used. This was particularly noticeable in a Acinetobacter baumannii genomic context. As this leads to false negatives, please depreciate support of -split_prodigal_jobs in RGI.

Apr 17 '24 20:04 agmcarthur

Looking at the implementation, that is likely a product of not generating and using a single training file for the whole genome before running the subjob ORF calling. This means each split is tuning the ORF finding model on only the sequence subset it gets thus lower accuracy.

Similar issue can occur if running on a large set of related genomes. Low quality genomes will have even worse ORF calling because the model trained on them will be poorer. Training on all genomes then using that training file would maximise accuracy/consistency (or moving to a ggcaller approach!)

Apr 17 '24 21:04 fmaguire

Issue is stale and will be closed in 7 days unless there is new activity

Jun 17 '24 11:06 github-actions[bot]

Re-opening to assess if we should handle training better or depreciate.

Jul 04 '24 15:07 agmcarthur

Issue is stale and will be closed in 7 days unless there is new activity

Sep 03 '24 11:09 github-actions[bot]

rgi rgi copied to clipboard

depreciate -split_prodigal_jobs

rgi
rgi copied to clipboard