prokka icon indicating copy to clipboard operation
prokka copied to clipboard

Add information about plasmid annotation

Open tseemann opened this issue 6 years ago • 9 comments

For plasmid you will not get a good result if you just use the default settings.

I would recommend getting GENBANK files (.gbk or .gb) of all the plasmids that are similar to your one.

Say you get three of them p1.gbk p2.gbk p3.gbk Then make a single genbank file: cat p1.gbk p2.gbk p3.gbk > plasmids.gbk Then run prokka with: --proteins plasmids.gbk

That will give a much better names for the proteins in your plasmid.

The next verison of Prokka will have a proper plasmid database included.

tseemann avatar Jul 06 '18 22:07 tseemann

Great suggestion! I was able to get improved results after using the genbank files.

I used PLSDB to get the plasmid sequences of interest. PLSDB information might be useful for others.

sagarutturkar avatar Jun 18 '19 02:06 sagarutturkar

@sagarutturkar thanks for the tip about PLSDB!

tseemann avatar Jun 18 '19 09:06 tseemann

Turns out their are 1.1 million unique proteins in all refseq plasmids. Clustered down to about 250,000. That's way bigger than the 22,000 core chromosomal DB i am using!

tseemann avatar Oct 11 '19 07:10 tseemann

How big a fraction of these still has no known function?

Kirk3gaard avatar Oct 11 '19 09:10 Kirk3gaard

That's after I excluded hypotheticals. BUT It turns out that those stats are all wrong, and include lots of chromosomes. WHY? https://ftp.ncbi.nlm.nih.gov/refseq/release/plasmid/ has all the CDS of chromosomes in it too WTF!

tseemann avatar Oct 12 '19 06:10 tseemann

Hi @tseemann, Perhaps this would help as well? A Curated, Comprehensive Database of Plasmid Sequences

ABSTRACT Plasmid sequences are central to a myriad of microbial functions and processes. Here, we have compiled a database of complete plasmid sequences and associated metadata curated from both NCBI’s recent genome database update, which includes plasmids as organisms, and all available annotated bacterial genomes. The resultant database contains 10,892 complete plasmid sequences and associated metadata.

edfadeev avatar Nov 19 '19 14:11 edfadeev

I need a database of non-redudant plasmid-<U>specific</U> proteins and corresponding /gene, /EC_number (and /COG if possible)

tseemann avatar Nov 24 '19 01:11 tseemann

Hi @tseemann,

I am attempting at reducing to a minimum the number of hypothetical proteins in my genomes. Some genomes are complete (all replicons are closed) while others are not.

  1. For closed genomes, I use --proteins with .gbk files of either chromosomes or plasmids depending on what I am annotating (so each separately)
  2. For draft genomes, I sometimes have a few closed replicons that are clearly plasmids (so I use --proteins with plasmid .gbk) but at other times I do not. How would you advice I proceed? I also used --prodigaltf [trained using prodigal -t on a closed genome]

Could you please weigh in on the approach I am taking? I also appreciate any advice that may help! Thanks and cheers, Kat

katdotfasta avatar Dec 16 '19 16:12 katdotfasta

That's after I excluded hypotheticals. BUT It turns out that those stats are all wrong, and include lots of chromosomes. WHY? https://ftp.ncbi.nlm.nih.gov/refseq/release/plasmid/ has all the CDS of chromosomes in it too WTF!

HI @tseemann, if I remember well my classes, plasmids are made of bacterial genes for a part. Could it be that? I just annotated my two natural plasmids using bacterial settings and it returned a number of ORFs among which known bacterial genes. What is may be missing are replication regions and other regulatory elements but at least the ORFs are there right?

splaisan avatar Dec 08 '21 11:12 splaisan