prokka
prokka copied to clipboard
Please rename your contigs or use --centre XXX to generate clean contig names
So I just tried prokka on a file that contained one sequence as such:
>contig00001 length=455937 numreads=17237
AACTAACAACTAACAACCAACAACAAACCACTAACACATTTGTCTTTCTACAGCCGCTGG
ATCTTTCCCTATTTGATGGATATTGCGATGCGAGATTCGTTGTTCACGCGTCATCGGGTC
GGATTGCTGTCTGCTGTGAGGGGGGATGTGTTGGAAATCGGAGTCGGTACCGGATTGAAT
TTGAAGCATTATCCTGAGCAGACGACGCGGCTCAATGTAGTGGATTCCAATCCTGGAATG
AACGTGCTTCTGCGTCGCCGCATGAAGGGTATTCCATTTCCTGTGCAGCATGCCACAATC
The command:
~/programs/prokka111/bin/prokka --outdir output --cpus 4 --locustag prokka --compliant --usegenus --metagenome --addgene --quiet --force --centre XXX contig.fasta
And it complains with the error:
[02:13:14] Please rename your contigs or use --centre XXX to generate clean contig names.
How can this be bothersome for prokka ? It's just a header text, if it doesn't like it, it can just not read it. Or rename it by itself in memory. This is the kind of thing that just makes a product un-userfrienly. I don't really feel like renaming my hundred of thousands of contigs...
If I add, as suggested, --centre XXX
to the command, I still get the same error.
I believe the limit comes from how Prokka tries to use the contig names in the GenBank output (see #32).
Confusingly the names have nothing to do with my input FASTA file (so the error message is misleading). The names seem to be auto-generated by Prokka itselt, e.g. gnl|institute|locus_contig000001
where I can set the center using --centre institute
and locus tag using --locustag locus
but seem to have have no control over the long contig000001
part.
Workaround: --centre C --locustag L
gives names like gnl|C|L_contig000001
which are short enough (testing with Prokka 1.11).
Shortening contig000001
to c00001
might help, but I think the real fix is to adjust what Prokka uses as the contig identifiers in the GenBank file - why not just use contig000001
in the GenBank LOCUS line?
I agree only using contig000001
or even shorter c000001
would be a workaround for this issue. Actually the whole problem lies with NCBI's tbl2asn
which is very strict with its GenBank IDs (as discussed here #76 and other issues at length).
@tseemann might also replace tbl2asn
in the future, see issue #113. I guess it depends if users still like SQN files for NCBI submissions.
Yes, in a sense this (#135) and #76 are duplicates - this bug report has the error in the bug title so was easier to find.
For now I have reverted to Prokka 1.10
This may be a closed thread, but I had the same problem using contigs that were assembled by velvet (long contig names by default, for example: >NODE_1_length_35596_cov_60.583466)
I used sed to remove everything after the node number using: sed -re 's/(_length)[^=]*$/\1/' ${n}.fasta
where ${n} is of course your filename
That resolved the issue with contig names and prokka ran.
@haslamdb sed -re 's/(_length)[^=]*$/\1/' $ ~/velvet/454_roche_13 this command worked. However, I can't find the output file, which was converted name successfully. How can I find converted file
@nhungdoan1905 this issue is from 2.5 years ago. Prokka 1.14 should be better?
@nhungdoan1905 this issue is from 2.5 years ago. Prokka 1.14 should be better?
No, I still got the same problem... sigh
Hi, I ran prokka 1.14 for several contigs (all labelled in a similar manner). But I got output as all files (like .fna, .log, .ffn.... )for some but for the other just got a .fna and log file (The log file states a warning : Please rename your contigs OR try '--centre X --compliant' to generate clean contig names).
Can anyone point out towards the mistake? Any help will be appreciated. Thank you
@nhungdoan1905 If you haven't solved this yet, the output was only showed on your screen.
You have to save it as a file like this
sed -re 's/(_length)[^=]*$/\1/' $ ~/velvet/454_roche_13 > my_output.fasta