prokka icon indicating copy to clipboard operation
prokka copied to clipboard

Missing short genes

Open sjackman opened this issue 11 years ago • 6 comments

Prodigal penalizes predicting genes that are shorter than 250 bp (83 aa). As a result, Prokka is missing a number of short proteins that do exist in a closely related species. Any thoughts on how to deal with this? By checking with prodigal -s I've learned that Prodigal is in fact predicting the genes, but they have a score < 0 and so are discarded.

sjackman avatar May 22 '14 00:05 sjackman

Changing the Prodigal short CDS penalty from 250 bp to 100 bp rescues 2 of the missing 9 short genes.

sjackman avatar May 22 '14 17:05 sjackman

Using --meta mode (-m anon in Prokka 2.7 from GitHub) saves 2 genes, and the combo of -m anon and reducing the short CDS penalty to 100 bp saves 6 of 9 short genes. Progress.

sjackman avatar May 22 '14 18:05 sjackman

See

  • https://github.com/sjackman/Prodigal/tree/short-cds-penalty
  • https://github.com/sjackman/Prodigal/commit/8612e36b456f71e5baa8a2b873b2b9dcb59add7e

sjackman avatar May 22 '14 18:05 sjackman

@sjackman I am thinking off adding a special database of well known small proteins. In Staph for example there is a 6aa "toxin" gene (!) which never gets found. By using a stricter glocal alignment (eg. glsearch36_t) this might make sense.

I've heard that there may exist databases of these things. This might be a start: http://compbio.cs.toronto.edu/psmdb/desc.html

If not, maybe we could infer one from records in Genbank?

tseemann avatar Nov 13 '14 04:11 tseemann

@sjackman I just went and looked at swissprot bacteria at non-fragment confirmed proteins, and there are about 4500 of them under 200aa long, of which about 1000 are under 100bp. I'm guessing Prodigal misses a lot of these. I may have to do something about this within Prokka.

tseemann avatar Feb 14 '15 02:02 tseemann

Dear @sjackman and @tseemann,

Has there been any update on handling issues related to missing short genes? I’m particularly interested in any recent changes or plans to address this.

ryu1013 avatar Aug 13 '24 01:08 ryu1013