funannotate icon indicating copy to clipboard operation
funannotate copied to clipboard

Error in funannotate annotate

Open jimie0311 opened this issue 3 years ago • 9 comments

Hi I am using funannotate 1.8.7 to annotate a fungal genome. Before gene annotation, InterProScan, Eggnog-mapper、antiSMASH and SignalP 5.0 were run out side of funannotate. funannotate iprscan -i /fun/update_results/species.proteins.fa -o iprscan.xml -m local --iprscan_path /opt/biosoft/interproscan-5.45-80.0/interproscan.sh --cpus 20 antismash /hap1_masked.fas --genefinding-gff3 /fun/update_results/species.gff3 emapper.py -i /fun/update_results/species.proteins.fa --output eggnog_diamond -m diamond --cpu 50 signalp -batch 30000 -org euk -fasta /fun/update_results/species.proteins.fa -gff3 -mature

However, when I run below command, the errors came. funannotate annotate -i fun --eggnog ./eggnogout/eggnog_diamond.emapper.annotations --iprscan iprscan.xml --antismash ./antismashout/hap1_masked.gbk --signalp ./signalpout/species.proteins_summary.signalp5 --busco_db /data/database/BUSCO/basidiomycota_odb9 --cpus 20 --strain "species" --isolate GD1913

ERROR nohup: ignoring input 2 [Aug 26 12:45 PM]: OS: Ubuntu 18.04, 160 cores, ~ 1056 GB RAM. Python: 3.7.10 3 [Aug 26 12:45 PM]: Running 1.8.7 4 [Aug 26 12:45 PM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '-- 5 [Aug 26 12:45 PM]: Found existing output directory fun. Warning, will re-use any intermediate files found. 6 [Aug 26 12:45 PM]: Parsing input files 7 [Aug 26 12:45 PM]: Existing tbl found: fun/update_results/ustilago_maydis.tbl 8 [Aug 26 12:49 PM]: Adding Functional Annotation to ustilago_maydis, NCBI accession: None 9 [Aug 26 12:49 PM]: Annotation consists of: 13,839 gene models 10 [Aug 26 12:49 PM]: 12,882 protein records loaded 11 [Aug 26 12:49 PM]: Existing Pfam-A results found: fun/annotate_misc/annotations.pfam.txt 12 [Aug 26 12:49 PM]: 9,033 annotations added 13 [Aug 26 12:49 PM]: Running Diamond blastp search of UniProt DB version 2021_03 14 [Aug 26 12:49 PM]: 410 valid gene/product annotations from 544 total 15 [Aug 26 12:49 PM]: Existing Eggnog-mapper results found: fun/annotate_misc/eggnog.emapper.annotations 16 [Aug 26 12:49 PM]: Parsing EggNog Annotations 17 [Aug 26 12:49 PM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.70 18 [Aug 26 12:49 PM]: 410 gene name and product description annotations added 19 [Aug 26 12:49 PM]: Existing MEROPS results found: fun/annotate_misc/annotations.merops.txt 20 [Aug 26 12:49 PM]: 280 annotations added 21 [Aug 26 12:49 PM]: Existing CAZYme results found: fun/annotate_misc/annotations.dbCAN.txt 22 [Aug 26 12:49 PM]: 255 annotations added 23 [Aug 26 12:49 PM]: Existing BUSCO2 results found: fun/annotate_misc/annotations.busco.txt 24 [Aug 26 12:49 PM]: 1,262 annotations added 25 [Aug 26 12:49 PM]: Existing Phobius results found: fun/annotate_misc/phobius.results.txt 26 [Aug 26 12:49 PM]: Existing SignalP results found: fun/annotate_misc/signalp.results.txt 27 Traceback (most recent call last): 28 File "/data/Liangjunmin/opt/biosoft/miniconda3_for_pb-assembly/envs/funannotate/bin/funannotate", line 10, in 29 sys.exit(main()) 30 File "/data/Liangjunmin/opt/biosoft/miniconda3_for_pb-assembly/envs/funannotate/lib/python3.7/site-packages/funannotate/funannota 31 mod.main(arguments) 32 File "/data/Liangjunmin/opt/biosoft/miniconda3_for_pb-assembly/envs/funannotate/lib/python3.7/site-packages/funannotate/annotate. 33 phobius_out, signalp_out, membrane_out, secreted_out) 34 File "/data/Liangjunmin/opt/biosoft/miniconda3_for_pb-assembly/envs/funannotate/lib/python3.7/site-packages/funannotate/library.p 35 if int(cols[1]) > 0: # then found TM domain 36 ValueError: invalid literal for int() with base 10: 'Organism: euk'

Thanks

jimie0311 avatar Aug 27 '21 12:08 jimie0311

You should not run antiSMASH with GFF -- use the GBK file from predict_results and turn off gene-finding. And then run signalP like this: signalp -stdout -org euk -format short -fasta proteins.genome > signalp.out. But its much easier to just let funannotate run signalP as it will run it multiprocessing by splitting input and then ensure your format is correct.....

nextgenusfs avatar Aug 27 '21 13:08 nextgenusfs

Hi Jon, Thanks for your instant reply. Sorry I can not use gbk file to run antiSMASH since there are several alternative splicing for each gene and the error will come up as "multiple CDS features have the same name for mapping". Finally I used "antismash genome.fasta --genefinding-gff3 /fun/update_results/species.gff3"

For signalp, how to run signalp in Funannotate? Sorry I didn't find the related documentation. I tried signalp as you suggested. I seemed the results is the same as those obtained by "signalp -batch 30000 -org euk -fasta /fun/update_results/species.proteins.fa -gff3 -mature" obtained from your way 1 # SignalP-5.0 Organism: euk Timestamp: 20210825205549 2 # ID Prediction SP(Sec/SPI) OTHER CS Position 3 FUN_000002-T1 OTHER 0.018607 0.981393 4 FUN_000003-T1 OTHER 0.000467 0.999533 5 FUN_000004-T1 OTHER 0.000502 0.999498 6 FUN_000004-T2 OTHER 0.000440 0.999560 7 FUN_000007-T1 OTHER 0.001157 0.998843 8 FUN_000009-T1 OTHER 0.001537 0.998463 9 FUN_000012-T1 OTHER 0.000786 0.999214 10 FUN_000012-T2 OTHER 0.000786 0.999214 11 FUN_000012-T3 OTHER 0.000786 0.999214 12 FUN_000013-T1 OTHER 0.000579 0.999421 13 FUN_000015-T1 OTHER 0.002556 0.997444 14 FUN_000017-T1 OTHER 0.002832 0.997168 15 FUN_000019-T1 OTHER 0.001105 0.998895 16 FUN_000020-T1 OTHER 0.000495 0.999505 17 FUN_000025-T1 OTHER 0.001763 0.998237 18 FUN_000028-T1 OTHER 0.007864 0.992136 19 FUN_000029-T1 OTHER 0.001680 0.998320 20 FUN_000030-T1 OTHER 0.001432 0.998568 21 FUN_000031-T1 OTHER 0.010161 0.989839 22 FUN_000032-T1 OTHER 0.004603 0.995397 23 FUN_000033-T1 OTHER 0.001979 0.998021 24 FUN_000034-T1 OTHER 0.001697 0.998303 25 FUN_000035-T1 OTHER 0.001207 0.998793 26 FUN_000036-T1 OTHER 0.014808 0.985192 27 FUN_000037-T1 OTHER 0.003916 0.996084 28 FUN_000038-T1 OTHER 0.002448 0.997552 ...... obtained from your way 1 # SignalP-5.0 Organism: euk Timestamp: 20210830075525 2 # ID Prediction SP(Sec/SPI) OTHER CS Position 3 FUN_000002-T1 OTHER 0.018607 0.981393 4 FUN_000003-T1 OTHER 0.000467 0.999533 5 FUN_000004-T1 OTHER 0.000502 0.999498 6 FUN_000004-T2 OTHER 0.000440 0.999560 7 FUN_000007-T1 OTHER 0.001157 0.998843 8 FUN_000009-T1 OTHER 0.001537 0.998463 9 FUN_000012-T1 OTHER 0.000786 0.999214 10 FUN_000012-T2 OTHER 0.000786 0.999214 11 FUN_000012-T3 OTHER 0.000786 0.999214 12 FUN_000013-T1 OTHER 0.000579 0.999421 13 FUN_000015-T1 OTHER 0.002556 0.997444 14 FUN_000017-T1 OTHER 0.002832 0.997168 15 FUN_000019-T1 OTHER 0.001105 0.998895 16 FUN_000020-T1 OTHER 0.000495 0.999505 17 FUN_000025-T1 OTHER 0.001763 0.998237 18 FUN_000028-T1 OTHER 0.007864 0.992136 19 FUN_000029-T1 OTHER 0.001680 0.998320 20 FUN_000030-T1 OTHER 0.001432 0.998568 21 FUN_000031-T1 OTHER 0.010161 0.989839 22 FUN_000032-T1 OTHER 0.004603 0.995397 23 FUN_000033-T1 OTHER 0.001979 0.998021 24 FUN_000034-T1 OTHER 0.001697 0.998303 25 FUN_000035-T1 OTHER 0.001207 0.998793 26 FUN_000036-T1 OTHER 0.014808 0.985192 27 FUN_000037-T1 OTHER 0.003916 0.996084 28 FUN_000038-T1 OTHER 0.002448 0.997552 Thanks.

jimie0311 avatar Aug 30 '21 01:08 jimie0311

Per signalP is expecting the results to be tab-delimited -- are they not tab delimited for some reason? I do not have license for signalP > 4.1, so I've never had an actual copy in my hand but wrote the parser based off of user feedback.

antismash error -- interesting I've def run multi-transcript genomes through with Genbank, perhaps its a new version. But great if the GFF now works with newest antismash.

nextgenusfs avatar Aug 30 '21 01:08 nextgenusfs

Hi Jon, Could u give me an example for the signalp result used for fuannotate annotate?

jimie0311 avatar Aug 30 '21 01:08 jimie0311

I can only give an example from v4.1... but the output looks correct for v5 except the parser is expecting tab delimited.... so if your file has spaces and not tabs that is the problem and I'll need to update the parser to just split on spaces if tabs aren't found.

nextgenusfs avatar Aug 30 '21 02:08 nextgenusfs

I didn't know how to resolve it finally. I just remove --signalp from the funannotate annotate and try to merge the signalp results manually. Thanks.

jimie0311 avatar Sep 03 '21 01:09 jimie0311

So you were unable to tell if signalP 5.0 on your computer was generating tab delimited or space delimited output? All data I had seen was tab delimited-- my guess is your particular version is outputting space delimited output which is why the parser is failing. I can fix it, I just need the answer to that question.

nextgenusfs avatar Sep 03 '21 02:09 nextgenusfs

Sorry,I may misunderstand your mean. Please see attachment to find the outputs of Signalp5 separately outside.

The command I used for signalp5 was signalp -batch 30000 -org euk -format short -fasta proteins.fa -gff3 -mature

Thanks for your generous help.

Junmin Liang State Key Laboratory of Mycology, Institute of Microbiology Chinese Academy of Sciences

No.1 Beichen West Road, Chaoyang District, Beijing, P. R. China 100101

On 9/3/2021 10:23,Jon @.***> wrote:

So you were unable to tell if signalP 5.0 on your computer was generating tab delimited or space delimited output? All data I had seen was tab delimited-- my guess is your particular version is outputting space delimited output which is why the parser is failing. I can fix it, I just need the answer to that question.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

jimie0311 avatar Sep 03 '21 02:09 jimie0311

I cannot tell how the columns in your file are delimited when you paste it into GitHub. Either attach the signalP output file or open in a text editor and check if there are spaces in between the columns or tabs.

nextgenusfs avatar Sep 03 '21 04:09 nextgenusfs