PPanGGOLiN Error in WriteBinaries.py

While trying to run the program on a dataset of 200 annotated genomes (gbff, bakta annotated) I got this error:

Traceback (most recent call last): File "tables/tableextension.pyx", line 1596, in tables.tableextension.Row.setitem TypeError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/bostosed/miniconda3/bin/ppanggolin", line 10, in sys.exit(main()) File "/home/bostosed/miniconda3/lib/python3.8/site-packages/ppanggolin/main.py", line 225, in main ppanggolin.workflow.workflow.launch(args) File "/home/bostosed/miniconda3/lib/python3.8/site-packages/ppanggolin/workflow/workflow.py", line 31, in launch writePangenome(pangenome, filename, args.force, disable_bar=args.disable_prog_bar) File "/home/bostosed/miniconda3/lib/python3.8/site-packages/ppanggolin/formats/writeBinaries.py", line 650, in writePangeno me writeAnnotations(pangenome, h5f, disable_bar=disable_bar) File "/home/bostosed/miniconda3/lib/python3.8/site-packages/ppanggolin/formats/writeBinaries.py", line 102, in writeAnnotat ions geneRow["gene/product"] = gene.product File "tables/tableextension.pyx", line 1601, in tables.tableextension.Row.setitem TypeError: invalid type (<class 'str'>) for column gene/product Closing remaining open files:/home/bostosed/data1/out_ppan/pangenome.h5...done

I re-ran it with just 10 genomes and the program ran fine, which tells me that maybe one of the genomes has something wrong, but is there a way to flag that genome without having to check them manually?

Sep 24 '22 12:09 EdderDaniel

Hi, My guess here is that one of the genome has a very weird character in its product field, and somehow along the line that has an impact on the gene.product type in python ? Though that's a bit unexpected.

Which version did you use ?

Adelme

Sep 27 '22 08:09 axbazin

Hi! I used version 1.2.74

Sep 27 '22 08:09 EdderDaniel

Hi, Can you try to find the gff files containing non-ASCII characters?

LC_ALL=C grep -n -P [$'\x80'-$'\xFF'] *.gff

David

Sep 27 '22 08:09 dvallenet

ah, yeah. I found these. I think its the 5´-3´ right?

Aulosira_sp_FACHB-615.gff3:4266:contig_32 Prodigal CDS 2873 5122 . + 0 ID=BFEIHL_20900;Name=ATPase/5��-3�� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=BFEIHL_20900;product=ATPase/5��-3�� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);Dbxref=COG:COG0507,COG:L,EC:3.1.11.5,KEGG:K03581,RefSeq:WP_190873607.1,SO:0001217,UniParc:UPI0016820F02,UniRef:UniRef100_UPI0016820F02,UniRef:UniRef50_D6U4N4,UniRef:UniRef90_A0A2Z6D3L5;gene=recD

Sep 27 '22 09:09 EdderDaniel

It looks like it. Can you remove those characters from the gff (or gbff, whichever you are using) and try again?

Sep 27 '22 09:09 axbazin

Yes! It is running now. Thanks!

Sep 27 '22 10:09 EdderDaniel

Closing as this was solved!

Feb 15 '23 17:02 axbazin

Hello,

I have the same problem, I read the solution you offer but I'm wondering: the single-quote character is supposed to be an ASCII character [https://www.ascii-code.com/39] so, why does this error occur?

This other thing is that the single-quote character is present at other places in my gff files where it's not shown as a non-ASCII character when using the command David suggested. Why is that? Thanks C.

Apr 17 '23 14:04 cmonat

Hello,

If I'm not mistaken here the character is not a single quote, but an acute accent, which is the heart of the problem. Acute accent is only present in some character sets of the "extended ASCII" codes and is usually this: https://www.ascii-code.com//180

Though, as you can see in the previous link, in some sets it's not "acute accent" but some other character.

So, as we're storing strings in 7-bit ASCII encoding, extended ASCII is not included in any case, and reading it fails.

As for why you have extended ASCII characters in your files, it's hard to tell. The most likely reason is that the original encoding of the file is either utf-8 or latin1. In theory, if only original ASCII characters are used, it's not a problem, but here it's not the case.

Adelme

Apr 17 '23 15:04 axbazin

Ok, thanks a lot for this explanation! 😃 C.

Apr 17 '23 15:04 cmonat

get same problem with Bakta annotation

ID=XXXX_18970;Name=ATPase/5��-3�� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=XXXX_18970;product=ATPase/5��-3�� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);Dbxref=COG:COG0507,COG:L,RefSeq:WP_112548299.1,SO:0001217,UniParc:UPI000D91DB82,UniRef:UniRef100_UPI000D91DB82,UniRef:UniRef50_A0A064E8F1,UniRef:UniRef90_UPI000B7CE3AD;gene=recD

May 25 '23 15:05 Nilad

It's always and only this damn helicase.

I will write an issue to the Bakta repository and hope for the best.

In the meantime I guess you can always replace the acute accent with a single quote character.

May 25 '23 20:05 axbazin

Oliver will fix this in the next Bakta version ! So I'm closing this again. In the meantime, if it happens to someone, the best current solution is to remove those characters from the gff3 files.

May 26 '23 11:05 axbazin

The new release of Bakta v1.8.0 fixed the problem! That should not happen anymore. If it still does while using bakta please upgrade to the latest version and it should fix it.

May 31 '23 07:05 axbazin

Not yet...

Try on bakta 1.8.1.

Get on file.gff3 :

contig_279	Prodigal	CDS	408	764	.	-	0	ID=PMIR_17920;Name=putative DNA-binding protein with ���double-wing��� structural motif%2C MmcQ/YjbR family;locus_tag=PMIR_17920;product=putative DNA-binding protein with ���double-wing��� structural motif%2C MmcQ/YjbR family;Dbxref=COG:COG2315,COG:K,RefSeq:WP_012367822.1,SO:0001217,UniParc:UPI00017AFEA6,UniRef:UniRef100_A0A5F0RUA5,UniRef:UniRef50_A0A1S1HTR5,UniRef:UniRef90_A0A1Z1SX38;gene=mmcQ

I want to make a new cgMLST with 934 assemblies. I got this annotation on 894 of them.... :(

Jun 19 '23 14:06 Nilad

Thanks for the heads-up: https://www.ncbi.nlm.nih.gov/research/cog/cog/COG2315/ -> K - COG2315 - Predicted DNA-binding protein with Ã¢â‚¬Ëœdouble-wingÃ¢â‚¬â„¢ structural motif, MmcQ/YjbR family what the heck...

Jun 19 '23 15:06 oschwengers

When I warned the COG team about the previous family, they said they'd fix it. I warned them again about this one.

Hopefully the next COG release will fix this once and for all.

Jun 19 '23 18:06 axbazin

I'm still not sure how to fix this. And in addition, I'm a bit reluctant to add unique string fixes to the Bakta code for each wrong COG annotation - who knows, how many of them will occur next?

Maybe some sed grep magic will help here? I'm sorry that I cannot provide more help in this case. But, I do hope, the next COG version will be released, soon ;-)

Aug 09 '23 08:08 oschwengers

Clearly understandable! Thank you for looking into it anyway Hopefully they'll release soon and it'll be fixed.

In the meantime, we'll have to filter out the weird stuff.

Aug 09 '23 11:08 axbazin

PPanGGOLiN PPanGGOLiN copied to clipboard

Error in WriteBinaries.py

PPanGGOLiN
PPanGGOLiN copied to clipboard