PPanGGOLiN
PPanGGOLiN copied to clipboard
Error in WriteBinaries.py
While trying to run the program on a dataset of 200 annotated genomes (gbff, bakta annotated) I got this error:
Traceback (most recent call last): File "tables/tableextension.pyx", line 1596, in tables.tableextension.Row.setitem TypeError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/bostosed/miniconda3/bin/ppanggolin", line 10, in gene/product
Closing remaining open files:/home/bostosed/data1/out_ppan/pangenome.h5...done
I re-ran it with just 10 genomes and the program ran fine, which tells me that maybe one of the genomes has something wrong, but is there a way to flag that genome without having to check them manually?
Hi, My guess here is that one of the genome has a very weird character in its product field, and somehow along the line that has an impact on the gene.product type in python ? Though that's a bit unexpected.
Which version did you use ?
Adelme
Hi! I used version 1.2.74
Hi, Can you try to find the gff files containing non-ASCII characters?
LC_ALL=C grep -n -P [$'\x80'-$'\xFF'] *.gff
David
ah, yeah. I found these. I think its the 5´-3´ right?
Aulosira_sp_FACHB-615.gff3:4266:contig_32 Prodigal CDS 2873 5122 . + 0 ID=BFEIHL_20900;Name=ATPase/5���-3��� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=BFEIHL_20900;product=ATPase/5���-3��� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);Dbxref=COG:COG0507,COG:L,EC:3.1.11.5,KEGG:K03581,RefSeq:WP_190873607.1,SO:0001217,UniParc:UPI0016820F02,UniRef:UniRef100_UPI0016820F02,UniRef:UniRef50_D6U4N4,UniRef:UniRef90_A0A2Z6D3L5;gene=recD
It looks like it. Can you remove those characters from the gff (or gbff, whichever you are using) and try again?
Yes! It is running now. Thanks!
Closing as this was solved!
Hello,
I have the same problem, I read the solution you offer but I'm wondering: the single-quote character is supposed to be an ASCII character [https://www.ascii-code.com/39] so, why does this error occur?
This other thing is that the single-quote character is present at other places in my gff files where it's not shown as a non-ASCII character when using the command David suggested. Why is that? Thanks C.
Hello,
If I'm not mistaken here the character is not a single quote, but an acute accent, which is the heart of the problem. Acute accent is only present in some character sets of the "extended ASCII" codes and is usually this: https://www.ascii-code.com//180
Though, as you can see in the previous link, in some sets it's not "acute accent" but some other character.
So, as we're storing strings in 7-bit ASCII encoding, extended ASCII is not included in any case, and reading it fails.
As for why you have extended ASCII characters in your files, it's hard to tell. The most likely reason is that the original encoding of the file is either utf-8 or latin1. In theory, if only original ASCII characters are used, it's not a problem, but here it's not the case.
Adelme
Ok, thanks a lot for this explanation! 😃 C.
get same problem with Bakta annotation
ID=XXXX_18970;Name=ATPase/5���-3��� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=XXXX_18970;product=ATPase/5���-3��� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);Dbxref=COG:COG0507,COG:L,RefSeq:WP_112548299.1,SO:0001217,UniParc:UPI000D91DB82,UniRef:UniRef100_UPI000D91DB82,UniRef:UniRef50_A0A064E8F1,UniRef:UniRef90_UPI000B7CE3AD;gene=recD
It's always and only this damn helicase.
I will write an issue to the Bakta repository and hope for the best.
In the meantime I guess you can always replace the acute accent with a single quote character.
Oliver will fix this in the next Bakta version ! So I'm closing this again. In the meantime, if it happens to someone, the best current solution is to remove those characters from the gff3 files.
The new release of Bakta v1.8.0 fixed the problem! That should not happen anymore. If it still does while using bakta please upgrade to the latest version and it should fix it.
Not yet...
Try on bakta 1.8.1
.
Get on file.gff3 :
contig_279 Prodigal CDS 408 764 . - 0 ID=PMIR_17920;Name=putative DNA-binding protein with ���double-wing��� structural motif%2C MmcQ/YjbR family;locus_tag=PMIR_17920;product=putative DNA-binding protein with ���double-wing��� structural motif%2C MmcQ/YjbR family;Dbxref=COG:COG2315,COG:K,RefSeq:WP_012367822.1,SO:0001217,UniParc:UPI00017AFEA6,UniRef:UniRef100_A0A5F0RUA5,UniRef:UniRef50_A0A1S1HTR5,UniRef:UniRef90_A0A1Z1SX38;gene=mmcQ
I want to make a new cgMLST with 934 assemblies. I got this annotation on 894 of them.... :(
Thanks for the heads-up:
https://www.ncbi.nlm.nih.gov/research/cog/cog/COG2315/
-> K - COG2315 - Predicted DNA-binding protein with ‘double-wing’ structural motif, MmcQ/YjbR family
what the heck...
When I warned the COG team about the previous family, they said they'd fix it. I warned them again about this one.
Hopefully the next COG release will fix this once and for all.
I'm still not sure how to fix this. And in addition, I'm a bit reluctant to add unique string fixes to the Bakta code for each wrong COG annotation - who knows, how many of them will occur next?
Maybe some sed
grep
magic will help here? I'm sorry that I cannot provide more help in this case. But, I do hope, the next COG version will be released, soon ;-)
Clearly understandable! Thank you for looking into it anyway Hopefully they'll release soon and it'll be fixed.
In the meantime, we'll have to filter out the weird stuff.