funannotate
funannotate copied to clipboard
Update fails : no ID
Are you using the latest release? I am using version 1.8.7
Describe the bug Funannotate update fails at the end while comparing annotations. 2 models have no ID. I ran train and predict, both ran successfully and the output is in a directory called 'fun'
What command did you issue? funannotate update -i fun --cpus 8
Logfiles I had a very similar issue to issue #506, but I couldn't find the solution there. Here's the check-version, log files and tbl file. I have two loci without any ID: FUN_004038, FUN_012373
Here's what the error looked like:
[Jul 31 04:10 PM]: Writing 12,932 loci to TBL format: dropped 0 overlapping, 2 too short, and 0 frameshift gene models [Jul 31 04:10 PM]: Converting to Genbank format [Jul 31 04:14 PM]: Collecting final annotation files [Jul 31 04:15 PM]: Parsing GenBank files...comparing annotation [Jul 31 04:15 PM]: putative transcript from ncbi:FUN_004038-T1 has no ID (ncbi:FUN_004038-T1 None ncbi:FUN_004038-T1) [Jul 31 04:15 PM]: putative transcript from ncbi:FUN_012373-T1 has no ID (ncbi:FUN_012373-T1 None ncbi:FUN_012373-T1) Traceback (most recent call last): File "/local/cluster/funannotate/bin/funannotate", line 10, in sys.exit(main()) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/funannotate.py", line 705, in main mod.main(arguments) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/update.py", line 2432, in main compareAnnotations2(GBK, final_gbk, Changes, args=args) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/update.py", line 1424, in compareAnnotations2 newGenes[gene[2]]['CDS'], hitInfo['CDS']) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/update.py", line 1559, in pairwiseAED splitAED = [pAED[i:i+len(query)] for i in range(0, len(pAED), len(query))] ValueError: range() arg 3 must not be zero
fun_check_version.txt Rhizopogon_salebrosus.tbl.txt update_log_7_30.e9819512.txt
I don't see anything wrong in the tbl file for these two models, other than they are partial. In this function the script is actually parsing the genbank file -- can you post what the gene models look like in genbank format? (you can also just post the genbank file and I can look).
I'm having difficulties uploading the genbank file so I will email you it for now, but here's the gene models from that file ///////////////////////////////////////////////// first model
gene 173086..>173862
/locus_tag="FUN_004038"
mRNA join(173086..173143,173244..173498,173554..173670,
173722..>173862)
/locus_tag="FUN_004038"
/product="hypothetical protein"
CDS join(173315..173498,173554..173670,173722..>173867)
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:FUN_004038-T1"
/translation="MDITQITQNSQSAQSSQSAKLNKCAAQRRYRKKEGQGFLQLRQA
LKEVSKDPRSKQDTLRRASSELRRLADENKLLLQAIIEAWNNTGEFDKDDHLRPASDE
DPLLCMEWWSHDTTKMADITRSDMARHCVLSASVLPTINMFTSPEXX"
assembly_gap 173863..174515
/estimated_length=653
/gap_type="within scaffold"
/linkage_evidence="paired-ends"
ORIGIN .(etc)
//////////////////////////////////////////////////////////////////////// next model
LOCUS scaffold_1074 2585 bp DNA linear PLN 31-JUL-2021
DEFINITION Rhizopogon salebrosus.
ACCESSION
VERSION
KEYWORDS .
SOURCE Rhizopogon salebrosus
ORGANISM Rhizopogon salebrosus
Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
Agaricomycetes; Agaricomycetidae; Boletales; Suillineae;
Rhizopogonaceae; Rhizopogon.
REFERENCE 1 (bases 1 to 2585)
AUTHORS Palmer,J.M.
TITLE Direct Submission
JOURNAL Submitted (31-JUL-2021) CFMR, USDA Forest Service, 1 Gifford
Pinchot Drive, Madison, WI 53726, USA
COMMENT 'Annotated using 1.8.7'.
FEATURES Location/Qualifiers
source 1..2585
/organism="Rhizopogon salebrosus"
/mol_type="genomic DNA"
/db_xref="taxon:176626"
gene 1..1042
/locus_tag="FUN_012372"
mRNA join(1..234,281..337,395..419,478..606,664..786,834..1042)
/locus_tag="FUN_012372"
/product="hypothetical protein"
CDS join(63..234,281..337,395..419,478..606,664..786,834..924)
/locus_tag="FUN_012372"
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:FUN_012372-T1"
/translation="MFTLIMTNNPEIQERAQSEIDKVVGFNRLPNFNDLPALPYVGAL
IREMKRWHPVGPLGMAHSNMEDDFYKGYFIPKGASIVPNIWAMAHNPKKYPEPECFRP
DRFLKPDGTLNDDTLPWIFGFGRRRCPGTHVADASLWCAITCILATFTIKKLVGQEPE
IKWASGLSSYPLPFPCQFVPRKARDAQALTNLVQNSLL"
gene 1156..>1683
/locus_tag="FUN_012373"
mRNA join(1156..1231,1285..1421,1479..>1683)
/locus_tag="FUN_012373"
/product="hypothetical protein"
CDS join(1208..1231,1285..1421,1479..>1685)
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:FUN_012373-T1"
/translation="MVMLVHAEHPPSPFIPRDQPEVVVSKHVKRHNDSVASKRSPTVG
ALDDHDGVYLKPNPKLCVTFVGMGSQHPAMGRQLADKYPSFLASIKENDRILMEVYGQ
PSLLDRTGLFVPGAKCDLSP"
assembly_gap 1684..1698
/estimated_length=15
/gap_type="within scaffold"
/linkage_evidence="paired-ends"
ORIGIN
oops didnt mean to close it :)
Okay, I see the problem now. Neither of those two models have locus_tags in the CDS feature. Both of which seem to be because they abut an assembly gap and for whatever reason NCBI's tbl2asn
didn't write a locus_tag for the feature. I'm not sure why.... The code complains that the feature is not valid because it does not have a locus_tag -- without a locus_tag it is unable to assign the CDS to a gene model. NCBI GenBank format is odd in that there is actually nothing required to link the gene, mRNA, and CDS -- to make it worse, sometimes tbl2asn
doesn't write features in the same order.
When I take a closer look at the tbl file you sent I think I see the problem. The CDS feature is actually outside of the gene and mRNA feature, for example:
173086 >173862 gene
locus_tag FUN_004038
173086 173143 mRNA
173244 173498
173554 173670
173722 >173862
product hypothetical protein
transcript_id gnl|ncbi|FUN_004038-T1_mrna
protein_id gnl|ncbi|FUN_004038-T1
173315 173498 CDS
173554 173670
173722 >173867
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|FUN_004038-T1_mrna
protein_id gnl|ncbi|FUN_004038-T1
So I think this means there must be a problem with the adjustment of this gene model in relation to the sequencing gap. The end coordinate of the CDS model should likely read >173862 or the gene and mRNA features should also extend to 173867.
The same is true of the other model:
1156 >1683 gene
locus_tag FUN_012373
1156 1231 mRNA
1285 1421
1479 >1683
product hypothetical protein
transcript_id gnl|ncbi|FUN_012373-T1_mrna
protein_id gnl|ncbi|FUN_012373-T1
1208 1231 CDS
1285 1421
1479 >1685
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|FUN_012373-T1_mrna
protein_id gnl|ncbi|FUN_012373-T1
I need to see if I can find where this might be happening, ie if it is something that funannotate is doing or if it just comes from PASA directly and is parsed incorrectly.
@skylerhar1 can you pull out the tbl format for these gene model from funannotate predict
, ie in the predict_results/*.tbl file? That will help us determine if PASA modified these to be problematic of if they were that way from predict. Thanks.
Here it is! (converted to a txt file for github) Rhizopogon_salebrosus_fun_predict.txt
Okay, so PASA is making these changes as the model is different -- and then introducing these "errors". I should be able to find a way to fix them in the update script.
173315 >173808 gene
locus_tag FUN_004038
173315 173498 mRNA
173554 173670
173722 >173808
product hypothetical protein
transcript_id gnl|ncbi|FUN_004038-T1_mrna
protein_id gnl|ncbi|FUN_004038-T1
173315 173498 CDS
173554 173670
173722 >173808
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|FUN_004038-T1_mrna
protein_id gnl|ncbi|FUN_004038-T1
1208 >1421 gene
locus_tag FUN_012373
1208 1231 mRNA
1285 >1421
product hypothetical protein
transcript_id gnl|ncbi|FUN_012373-T1_mrna
protein_id gnl|ncbi|FUN_012373-T1
1208 1231 CDS
1285 >1421
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|FUN_012373-T1_mrna
protein_id gnl|ncbi|FUN_012373-T1
sounds good, let me know if there are any updates
Hi @skylerhar1 sorry for the delay. I'm unable to test this locally because I don't have a dataset that exhibits this behavior. So I pushed some updated code to a new branch and I will need your help to test.
You can install the updated code with this command from your funannotate environment:
python -m pip install git+https://github.com/nextgenusfs/funannotate.git@pasa_update_fix
And then if you could re-run your update command and see what the behavior is, ie does it error out? And then furthermore if you could get me these two gene models FUN_004038
and FUN_012373
I need to see if their coordinates are now fixed in relation to the sequence gap adjacent to each model.
Should I delete the update_results and update_misc files from the directory and then rerun update, or should I leave them in.
On Sat, Sep 11, 2021 at 11:39 AM Jon Palmer @.***> wrote:
[This email originated from outside of OSU. Use caution with links and attachments.]
Hi @skylerhar1 https://github.com/skylerhar1 sorry for the delay. I'm unable to test this locally because I don't have a dataset that exhibits this behavior. So I pushed some updated code to a new branch and I will need your help to test.
You can install the updated code with this command from your fun annotate environment:
python -m pip install @.***_update_fix
And then if you could re-run your update command and see what the behavior is, ie does it error out? And then furthermore if you could get me these two gene models FUN_004038 and FUN_012373 I need to see if their coordinates are now fixed in relation to the sequence gap adjacent to each model.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/624#issuecomment-917454214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQD3U45T4O6IWTO7LFAXUQDUBOO6PANCNFSM5BUPP7ZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Sincerely, Skyler Har
If you could rename the update_misc and update_results folder that would be good. Then it will run from beginning and we can compare results from old versus new.
I'm a little confused, the coordinates are now different by quite a bit with these gene models. You sure this is the same input from funannotate predict
? It seems like PASA this time either didn't call gene models in these regions or they were substantially changed?
Thats my bad, I will update you with the correct tbl results soon
Sorry about that, I accidentally used data from a different trial run. Heres the correct tbl file:
173086 >173862 gene locus_tag FUN_004038 173086 173143 mRNA 173244 173498 173554 173670 173722 >173862 product hypothetical protein transcript_id gnl|ncbi|FUN_004038-T1_mrna protein_id gnl|ncbi|FUN_004038-T1 173315 173498 CDS 173554 173670 173722 >173086 codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_004038-T1_mrna protein_id gnl|ncbi|FUN_004038-T1
1156 >1683 gene locus_tag FUN_012373 1156 1231 mRNA 1285 1421 1479 >1683 product hypothetical protein transcript_id gnl|ncbi|FUN_012373-T1_mrna protein_id gnl|ncbi|FUN_012373-T1 1208 1231 CDS 1285 1421 1479 >1156 codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_012373-T1_mrna protein_id gnl|ncbi|FUN_012373-T1
additionally it seems those two gene models were special because while there were 44 gene models that needed correcting, these two had special errors: FUN_000096 Feature begins or ends in gap starting at 214116 FUN_000427 Feature begins or ends in gap starting at 191884 FUN_001052 Feature begins or ends in gap starting at 541846 FUN_001326 Feature begins or ends in gap starting at 278278 FUN_001490 Feature begins or ends in gap starting at 278911 FUN_002558 Feature begins or ends in gap starting at 288775 FUN_003011 Feature begins or ends in gap starting at 131505 FUN_003085 Feature begins or ends in gap starting at 64330 FUN_003209 Feature begins or ends in gap starting at 160428 FUN_003251 Feature begins or ends in gap starting at 33414 FUN_003713 Feature begins or ends in gap starting at 55898 FUN_003721 Feature begins or ends in gap starting at 83050 FUN_004038 Location: Mixed strands in SeqLoc [(lcl|scaffold_33:173315-173498, 173554-173670, c173722-<173086)] FUN_004054 Feature begins or ends in gap starting at 40758 FUN_004391 Feature begins or ends in gap starting at 96178 FUN_005024 Feature begins or ends in gap starting at 22438 FUN_005046 Feature begins or ends in gap starting at 80164 FUN_005685 Feature begins or ends in gap starting at 47866 FUN_006193 Feature begins or ends in gap starting at 35322 FUN_006237 Feature begins or ends in gap starting at 95310 FUN_006594 Feature begins or ends in gap starting at 11874 FUN_006755 Feature begins or ends in gap starting at 29046 FUN_007256 Feature begins or ends in gap starting at 61500 FUN_007718 Feature begins or ends in gap starting at 41587 FUN_008205 Feature begins or ends in gap starting at 6217 FUN_008884 Feature begins or ends in gap starting at 17110 FUN_008915 Feature begins or ends in gap starting at 7657 FUN_008932 Feature begins or ends in gap starting at 4013 FUN_009350 Feature begins or ends in gap starting at 6175 FUN_009476 Feature begins or ends in gap starting at 19797 FUN_009798 Feature begins or ends in gap starting at 15724 FUN_010163 Feature begins or ends in gap starting at 15695 FUN_010720 Feature begins or ends in gap starting at 8141 FUN_011473 Feature begins or ends in gap starting at 5125 FUN_011964 Feature begins or ends in gap starting at 2283 FUN_012373 Location: Mixed strands in SeqLoc [(lcl|scaffold_1074:1208-1231, 1285-1421, c1479-<1156)] FUN_012865 Feature begins or ends in gap starting at 581860 FUN_012869 Feature begins or ends in gap starting at 163870