Missing short genes
Prodigal penalizes predicting genes that are shorter than 250 bp (83 aa). As a result, Prokka is missing a number of short proteins that do exist in a closely related species. Any thoughts on how to deal with this? By checking with prodigal -s I've learned that Prodigal is in fact predicting the genes, but they have a score < 0 and so are discarded.
Changing the Prodigal short CDS penalty from 250 bp to 100 bp rescues 2 of the missing 9 short genes.
Using --meta mode (-m anon in Prokka 2.7 from GitHub) saves 2 genes, and the combo of -m anon and reducing the short CDS penalty to 100 bp saves 6 of 9 short genes. Progress.
See
- https://github.com/sjackman/Prodigal/tree/short-cds-penalty
- https://github.com/sjackman/Prodigal/commit/8612e36b456f71e5baa8a2b873b2b9dcb59add7e
@sjackman I am thinking off adding a special database of well known small proteins. In Staph for example there is a 6aa "toxin" gene (!) which never gets found. By using a stricter glocal alignment (eg. glsearch36_t) this might make sense.
I've heard that there may exist databases of these things. This might be a start: http://compbio.cs.toronto.edu/psmdb/desc.html
If not, maybe we could infer one from records in Genbank?
@sjackman I just went and looked at swissprot bacteria at non-fragment confirmed proteins, and there are about 4500 of them under 200aa long, of which about 1000 are under 100bp. I'm guessing Prodigal misses a lot of these. I may have to do something about this within Prokka.
Dear @sjackman and @tseemann,
Has there been any update on handling issues related to missing short genes? I’m particularly interested in any recent changes or plans to address this.