prokka
prokka copied to clipboard
GenBank file: Contig name and length collision
Hey!
I have a problem with the GenBank files created by Prokka (v. 1.11) - the contig name and its length are not separated by a white space. I assume the reason is that I increased $MAXCONTIGIDLEN
to 40 because I wanted to use the sample IDs (e.g. C3830-198
) as prefix and locus tags. And the longer contig names seem to be problematic: If the string (contig ID + its length) has 29 characters or more then no white space is added between the ID and the length string. Example:
LOCUS C3830-198_contig000001443776 bp DNA linear 08-FEB-2016
LOCUS C3830-199_contig000001 65910 bp DNA linear 11-FEB-2016
Used version: v. 1.11
Additional settings: Set $MAXCONTIGIDLEN
to 40 in the Prokka executable.
CMD:
prokka --force --outdir <odir> --prefix <ID> --locustag <ID> --centre <centre> --gram neg --mincontiglen 200 --cpus 10 <fasta file>
Duplicate of, or closely related to, #135?
See also past issues like #32, #76, #113
Yes, it is related to issue #32. Sorry for the duplicate post, somehow I did not find it then I searched for questions about the same issue.
So there is no workaround to change that if I want to use my IDs as prefix and locus tags, right? Setting $MAXCONTIGIDLEN
to an appropriate number solves the issue described in #135 making Prokka to use the supplied IDs but the GenBank files may contain wrongly formatted locus lines.
If you don't want Prokka to rename your contig SeqIDs don't set --compliant
and --centre
, see issue #141.
However, if your SeqIDs are too long you cannot get "correct" Genbank files as you described. Prokka relies on tbl2asn
to create the Genbank flatfiles and this tool is very strict. See esp. #76 mentioned by @peterjc.
I'm getting "Contig ID must <= 20 chars lon" for contig names generated by spades. The advice on various blog posts was to use --centre. This has changed the name of the contigs but I'm still getting the same error. As it happens, Prokka is not directly usable with spades. Any advice or shall I just move to RAST?
@nikolay12 You could try renaming the contig in your SPADES assembly to sometimes short like c00001
, c00002
, ... with the original name in the FASTA description, and give that to Prokka?
If you use the latest github HEAD I have changed some code to make smaller contig names. It might help when you use --compliant
mode.
Yes, https://github.com/tseemann/prokka/commit/92940bcd299dea710a17f2954045ea0eada9121c ought to help - thanks!
This is a serious issue in my opinion. Users of SnapGene and SnapGene Viewer are accustomed to opening GenBank files, but when the LOCUS line looks like this:
LOCUS NODE_1_length_283141_cov_27.6228283141 bp DNA linear
the importer doesn't work. The GenBank standard stipulates that “users parse the LOCUS line based on whitespace-separated tokens”. Prokka is not compliant.
Is there a way to force a whitespace before the sequence length?
The GenBank changes to move away from the strict column based LOCUS line to white space separation are quite recent.
I wonder if the NCBI have updated tbl2asn
to handle this now, in which case Prokka just needs to ensure that that tool is up to date?
@bsglicker thank you very much, your message helped me.
@tseemann Hi! We are running prokka on some prokaryote assemblies from INSDC. The contig names (submitted by the users) are fairly long and we have had to use --compliant --centre X to overcome the failures due to name lengths. I have a couple of questions:
- Does prokka maintain a mapping of the contig names that it renamed? I can't seem to find such a file but it's possible I've missed it. We really do need to revert to the original contig names in the GFF files. We do not need the Genbank files at all.
- I've read above that switching to the latest version of tbl2asn may get rid of this problem. Are there plans for prokka to do that?