PPanGGOLiN icon indicating copy to clipboard operation
PPanGGOLiN copied to clipboard

Error in WriteBinaries.py

Open EdderDaniel opened this issue 2 years ago • 6 comments

While trying to run the program on a dataset of 200 annotated genomes (gbff, bakta annotated) I got this error:

Traceback (most recent call last): File "tables/tableextension.pyx", line 1596, in tables.tableextension.Row.setitem TypeError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/bostosed/miniconda3/bin/ppanggolin", line 10, in sys.exit(main()) File "/home/bostosed/miniconda3/lib/python3.8/site-packages/ppanggolin/main.py", line 225, in main ppanggolin.workflow.workflow.launch(args) File "/home/bostosed/miniconda3/lib/python3.8/site-packages/ppanggolin/workflow/workflow.py", line 31, in launch writePangenome(pangenome, filename, args.force, disable_bar=args.disable_prog_bar) File "/home/bostosed/miniconda3/lib/python3.8/site-packages/ppanggolin/formats/writeBinaries.py", line 650, in writePangeno me writeAnnotations(pangenome, h5f, disable_bar=disable_bar) File "/home/bostosed/miniconda3/lib/python3.8/site-packages/ppanggolin/formats/writeBinaries.py", line 102, in writeAnnotat ions geneRow["gene/product"] = gene.product File "tables/tableextension.pyx", line 1601, in tables.tableextension.Row.setitem TypeError: invalid type (<class 'str'>) for column gene/product Closing remaining open files:/home/bostosed/data1/out_ppan/pangenome.h5...done

I re-ran it with just 10 genomes and the program ran fine, which tells me that maybe one of the genomes has something wrong, but is there a way to flag that genome without having to check them manually?

EdderDaniel avatar Sep 24 '22 12:09 EdderDaniel

Hi, My guess here is that one of the genome has a very weird character in its product field, and somehow along the line that has an impact on the gene.product type in python ? Though that's a bit unexpected.

Which version did you use ?

Adelme

axbazin avatar Sep 27 '22 08:09 axbazin

Hi! I used version 1.2.74

EdderDaniel avatar Sep 27 '22 08:09 EdderDaniel

Hi, Can you try to find the gff files containing non-ASCII characters?

LC_ALL=C grep -n -P [$'\x80'-$'\xFF'] *.gff

David

dvallenet avatar Sep 27 '22 08:09 dvallenet

ah, yeah. I found these. I think its the 5´-3´ right?

Aulosira_sp_FACHB-615.gff3:4266:contig_32 Prodigal CDS 2873 5122 . + 0 ID=BFEIHL_20900;Name=ATPase/5���-3��� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=BFEIHL_20900;product=ATPase/5���-3��� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);Dbxref=COG:COG0507,COG:L,EC:3.1.11.5,KEGG:K03581,RefSeq:WP_190873607.1,SO:0001217,UniParc:UPI0016820F02,UniRef:UniRef100_UPI0016820F02,UniRef:UniRef50_D6U4N4,UniRef:UniRef90_A0A2Z6D3L5;gene=recD

EdderDaniel avatar Sep 27 '22 09:09 EdderDaniel

It looks like it. Can you remove those characters from the gff (or gbff, whichever you are using) and try again?

axbazin avatar Sep 27 '22 09:09 axbazin

Yes! It is running now. Thanks!

EdderDaniel avatar Sep 27 '22 10:09 EdderDaniel

Closing as this was solved!

axbazin avatar Feb 15 '23 17:02 axbazin

Hello,

I have the same problem, I read the solution you offer but I'm wondering: the single-quote character is supposed to be an ASCII character [https://www.ascii-code.com/39] so, why does this error occur?

This other thing is that the single-quote character is present at other places in my gff files where it's not shown as a non-ASCII character when using the command David suggested. Why is that? Thanks C.

cmonat avatar Apr 17 '23 14:04 cmonat

Hello,

If I'm not mistaken here the character is not a single quote, but an acute accent, which is the heart of the problem. Acute accent is only present in some character sets of the "extended ASCII" codes and is usually this: https://www.ascii-code.com//180

Though, as you can see in the previous link, in some sets it's not "acute accent" but some other character.

So, as we're storing strings in 7-bit ASCII encoding, extended ASCII is not included in any case, and reading it fails.

As for why you have extended ASCII characters in your files, it's hard to tell. The most likely reason is that the original encoding of the file is either utf-8 or latin1. In theory, if only original ASCII characters are used, it's not a problem, but here it's not the case.

Adelme

axbazin avatar Apr 17 '23 15:04 axbazin

Ok, thanks a lot for this explanation! 😃 C.

cmonat avatar Apr 17 '23 15:04 cmonat

get same problem with Bakta annotation

ID=XXXX_18970;Name=ATPase/5���-3��� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=XXXX_18970;product=ATPase/5���-3��� helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);Dbxref=COG:COG0507,COG:L,RefSeq:WP_112548299.1,SO:0001217,UniParc:UPI000D91DB82,UniRef:UniRef100_UPI000D91DB82,UniRef:UniRef50_A0A064E8F1,UniRef:UniRef90_UPI000B7CE3AD;gene=recD

Nilad avatar May 25 '23 15:05 Nilad

It's always and only this damn helicase.

I will write an issue to the Bakta repository and hope for the best.

In the meantime I guess you can always replace the acute accent with a single quote character.

axbazin avatar May 25 '23 20:05 axbazin

Oliver will fix this in the next Bakta version ! So I'm closing this again. In the meantime, if it happens to someone, the best current solution is to remove those characters from the gff3 files.

axbazin avatar May 26 '23 11:05 axbazin

The new release of Bakta v1.8.0 fixed the problem! That should not happen anymore. If it still does while using bakta please upgrade to the latest version and it should fix it.

axbazin avatar May 31 '23 07:05 axbazin

Not yet...

Try on bakta 1.8.1.

Get on file.gff3 :

contig_279	Prodigal	CDS	408	764	.	-	0	ID=PMIR_17920;Name=putative DNA-binding protein with ���double-wing��� structural motif%2C MmcQ/YjbR family;locus_tag=PMIR_17920;product=putative DNA-binding protein with ���double-wing��� structural motif%2C MmcQ/YjbR family;Dbxref=COG:COG2315,COG:K,RefSeq:WP_012367822.1,SO:0001217,UniParc:UPI00017AFEA6,UniRef:UniRef100_A0A5F0RUA5,UniRef:UniRef50_A0A1S1HTR5,UniRef:UniRef90_A0A1Z1SX38;gene=mmcQ

I want to make a new cgMLST with 934 assemblies. I got this annotation on 894 of them.... :(

Nilad avatar Jun 19 '23 14:06 Nilad

Thanks for the heads-up: https://www.ncbi.nlm.nih.gov/research/cog/cog/COG2315/ -> K - COG2315 - Predicted DNA-binding protein with ‘double-wing’ structural motif, MmcQ/YjbR family what the heck...

oschwengers avatar Jun 19 '23 15:06 oschwengers

When I warned the COG team about the previous family, they said they'd fix it. I warned them again about this one.

Hopefully the next COG release will fix this once and for all.

axbazin avatar Jun 19 '23 18:06 axbazin

I'm still not sure how to fix this. And in addition, I'm a bit reluctant to add unique string fixes to the Bakta code for each wrong COG annotation - who knows, how many of them will occur next?

Maybe some sed grep magic will help here? I'm sorry that I cannot provide more help in this case. But, I do hope, the next COG version will be released, soon ;-)

oschwengers avatar Aug 09 '23 08:08 oschwengers

Clearly understandable! Thank you for looking into it anyway Hopefully they'll release soon and it'll be fixed.

In the meantime, we'll have to filter out the weird stuff.

axbazin avatar Aug 09 '23 11:08 axbazin