prokka icon indicating copy to clipboard operation
prokka copied to clipboard

GenBank file: Contig name and length collision

Open VGalata opened this issue 8 years ago • 11 comments

Hey!

I have a problem with the GenBank files created by Prokka (v. 1.11) - the contig name and its length are not separated by a white space. I assume the reason is that I increased $MAXCONTIGIDLEN to 40 because I wanted to use the sample IDs (e.g. C3830-198) as prefix and locus tags. And the longer contig names seem to be problematic: If the string (contig ID + its length) has 29 characters or more then no white space is added between the ID and the length string. Example:

LOCUS C3830-198_contig000001443776 bp DNA linear 08-FEB-2016
LOCUS C3830-199_contig000001 65910 bp DNA linear 11-FEB-2016

Used version: v. 1.11 Additional settings: Set $MAXCONTIGIDLEN to 40 in the Prokka executable. CMD:

prokka --force --outdir <odir> --prefix <ID> --locustag <ID> --centre <centre> --gram neg --mincontiglen 200 --cpus 10 <fasta file>

VGalata avatar Feb 17 '16 12:02 VGalata

Duplicate of, or closely related to, #135?

See also past issues like #32, #76, #113

peterjc avatar Feb 26 '16 15:02 peterjc

Yes, it is related to issue #32. Sorry for the duplicate post, somehow I did not find it then I searched for questions about the same issue.

So there is no workaround to change that if I want to use my IDs as prefix and locus tags, right? Setting $MAXCONTIGIDLEN to an appropriate number solves the issue described in #135 making Prokka to use the supplied IDs but the GenBank files may contain wrongly formatted locus lines.

VGalata avatar Feb 29 '16 14:02 VGalata

If you don't want Prokka to rename your contig SeqIDs don't set --compliant and --centre, see issue #141. However, if your SeqIDs are too long you cannot get "correct" Genbank files as you described. Prokka relies on tbl2asn to create the Genbank flatfiles and this tool is very strict. See esp. #76 mentioned by @peterjc.

aleimba avatar Feb 29 '16 14:02 aleimba

I'm getting "Contig ID must <= 20 chars lon" for contig names generated by spades. The advice on various blog posts was to use --centre. This has changed the name of the contigs but I'm still getting the same error. As it happens, Prokka is not directly usable with spades. Any advice or shall I just move to RAST?

nikolay12 avatar Jun 07 '16 16:06 nikolay12

@nikolay12 You could try renaming the contig in your SPADES assembly to sometimes short like c00001, c00002, ... with the original name in the FASTA description, and give that to Prokka?

peterjc avatar Jun 07 '16 16:06 peterjc

If you use the latest github HEAD I have changed some code to make smaller contig names. It might help when you use --compliant mode.

tseemann avatar Jun 13 '16 09:06 tseemann

Yes, https://github.com/tseemann/prokka/commit/92940bcd299dea710a17f2954045ea0eada9121c ought to help - thanks!

peterjc avatar Jun 13 '16 09:06 peterjc

This is a serious issue in my opinion. Users of SnapGene and SnapGene Viewer are accustomed to opening GenBank files, but when the LOCUS line looks like this:

LOCUS NODE_1_length_283141_cov_27.6228283141 bp DNA linear

the importer doesn't work. The GenBank standard stipulates that “users parse the LOCUS line based on whitespace-separated tokens”. Prokka is not compliant.

Is there a way to force a whitespace before the sequence length?

bsglicker avatar Jan 24 '19 14:01 bsglicker

The GenBank changes to move away from the strict column based LOCUS line to white space separation are quite recent.

I wonder if the NCBI have updated tbl2asn to handle this now, in which case Prokka just needs to ensure that that tool is up to date?

peterjc avatar Jan 24 '19 14:01 peterjc

@bsglicker thank you very much, your message helped me.

valery-shap avatar Jun 01 '20 21:06 valery-shap

@tseemann Hi! We are running prokka on some prokaryote assemblies from INSDC. The contig names (submitted by the users) are fairly long and we have had to use --compliant --centre X to overcome the failures due to name lengths. I have a couple of questions:

  1. Does prokka maintain a mapping of the contig names that it renamed? I can't seem to find such a file but it's possible I've missed it. We really do need to revert to the original contig names in the GFF files. We do not need the Genbank files at all.
  2. I've read above that switching to the latest version of tbl2asn may get rid of this problem. Are there plans for prokka to do that?

nds avatar Jun 20 '22 11:06 nds