prokka Missing short genes

Prodigal penalizes predicting genes that are shorter than 250 bp (83 aa). As a result, Prokka is missing a number of short proteins that do exist in a closely related species. Any thoughts on how to deal with this? By checking with prodigal -s I've learned that Prodigal is in fact predicting the genes, but they have a score < 0 and so are discarded.

May 22 '14 00:05 sjackman

Changing the Prodigal short CDS penalty from 250 bp to 100 bp rescues 2 of the missing 9 short genes.

May 22 '14 17:05 sjackman

Using --meta mode (-m anon in Prokka 2.7 from GitHub) saves 2 genes, and the combo of -m anon and reducing the short CDS penalty to 100 bp saves 6 of 9 short genes. Progress.

May 22 '14 18:05 sjackman

See

https://github.com/sjackman/Prodigal/tree/short-cds-penalty
https://github.com/sjackman/Prodigal/commit/8612e36b456f71e5baa8a2b873b2b9dcb59add7e

May 22 '14 18:05 sjackman

@sjackman I am thinking off adding a special database of well known small proteins. In Staph for example there is a 6aa "toxin" gene (!) which never gets found. By using a stricter glocal alignment (eg. glsearch36_t) this might make sense.

I've heard that there may exist databases of these things. This might be a start: http://compbio.cs.toronto.edu/psmdb/desc.html

If not, maybe we could infer one from records in Genbank?

Nov 13 '14 04:11 tseemann

@sjackman I just went and looked at swissprot bacteria at non-fragment confirmed proteins, and there are about 4500 of them under 200aa long, of which about 1000 are under 100bp. I'm guessing Prodigal misses a lot of these. I may have to do something about this within Prokka.

Feb 14 '15 02:02 tseemann

Dear @sjackman and @tseemann,

Has there been any update on handling issues related to missing short genes? I’m particularly interested in any recent changes or plans to address this.

Aug 13 '24 01:08 ryu1013