funannotate icon indicating copy to clipboard operation
funannotate copied to clipboard

Update fails : no ID

Open skylerhar1 opened this issue 3 years ago • 15 comments

Are you using the latest release? I am using version 1.8.7

Describe the bug Funannotate update fails at the end while comparing annotations. 2 models have no ID. I ran train and predict, both ran successfully and the output is in a directory called 'fun'

What command did you issue? funannotate update -i fun --cpus 8

Logfiles I had a very similar issue to issue #506, but I couldn't find the solution there. Here's the check-version, log files and tbl file. I have two loci without any ID: FUN_004038, FUN_012373

Here's what the error looked like:

[Jul 31 04:10 PM]: Writing 12,932 loci to TBL format: dropped 0 overlapping, 2 too short, and 0 frameshift gene models [Jul 31 04:10 PM]: Converting to Genbank format [Jul 31 04:14 PM]: Collecting final annotation files [Jul 31 04:15 PM]: Parsing GenBank files...comparing annotation [Jul 31 04:15 PM]: putative transcript from ncbi:FUN_004038-T1 has no ID (ncbi:FUN_004038-T1 None ncbi:FUN_004038-T1) [Jul 31 04:15 PM]: putative transcript from ncbi:FUN_012373-T1 has no ID (ncbi:FUN_012373-T1 None ncbi:FUN_012373-T1) Traceback (most recent call last): File "/local/cluster/funannotate/bin/funannotate", line 10, in sys.exit(main()) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/funannotate.py", line 705, in main mod.main(arguments) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/update.py", line 2432, in main compareAnnotations2(GBK, final_gbk, Changes, args=args) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/update.py", line 1424, in compareAnnotations2 newGenes[gene[2]]['CDS'], hitInfo['CDS']) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/update.py", line 1559, in pairwiseAED splitAED = [pAED[i:i+len(query)] for i in range(0, len(pAED), len(query))] ValueError: range() arg 3 must not be zero

fun_check_version.txt Rhizopogon_salebrosus.tbl.txt update_log_7_30.e9819512.txt

skylerhar1 avatar Aug 05 '21 19:08 skylerhar1

I don't see anything wrong in the tbl file for these two models, other than they are partial. In this function the script is actually parsing the genbank file -- can you post what the gene models look like in genbank format? (you can also just post the genbank file and I can look).

nextgenusfs avatar Aug 12 '21 02:08 nextgenusfs

I'm having difficulties uploading the genbank file so I will email you it for now, but here's the gene models from that file ///////////////////////////////////////////////// first model

 gene            173086..>173862
                     /locus_tag="FUN_004038"
     mRNA            join(173086..173143,173244..173498,173554..173670,
                     173722..>173862)
                     /locus_tag="FUN_004038"
                     /product="hypothetical protein"
     CDS             join(173315..173498,173554..173670,173722..>173867)
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="ncbi:FUN_004038-T1"
                     /translation="MDITQITQNSQSAQSSQSAKLNKCAAQRRYRKKEGQGFLQLRQA
                     LKEVSKDPRSKQDTLRRASSELRRLADENKLLLQAIIEAWNNTGEFDKDDHLRPASDE
                     DPLLCMEWWSHDTTKMADITRSDMARHCVLSASVLPTINMFTSPEXX"
     assembly_gap    173863..174515
                     /estimated_length=653
                     /gap_type="within scaffold"
                     /linkage_evidence="paired-ends"
ORIGIN      .(etc)

//////////////////////////////////////////////////////////////////////// next model

LOCUS       scaffold_1074           2585 bp    DNA     linear   PLN 31-JUL-2021
DEFINITION  Rhizopogon salebrosus.
ACCESSION   
VERSION
KEYWORDS    .
SOURCE      Rhizopogon salebrosus
  ORGANISM  Rhizopogon salebrosus
            Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
            Agaricomycetes; Agaricomycetidae; Boletales; Suillineae;
            Rhizopogonaceae; Rhizopogon.
REFERENCE   1  (bases 1 to 2585)
  AUTHORS   Palmer,J.M.
  TITLE     Direct Submission
  JOURNAL   Submitted (31-JUL-2021) CFMR, USDA Forest Service, 1 Gifford
            Pinchot Drive, Madison, WI 53726, USA
COMMENT     'Annotated using 1.8.7'.
FEATURES             Location/Qualifiers
     source          1..2585
                     /organism="Rhizopogon salebrosus"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:176626"
     gene            1..1042
                     /locus_tag="FUN_012372"
     mRNA            join(1..234,281..337,395..419,478..606,664..786,834..1042)
                     /locus_tag="FUN_012372"
                     /product="hypothetical protein"
     CDS             join(63..234,281..337,395..419,478..606,664..786,834..924)
                     /locus_tag="FUN_012372"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="ncbi:FUN_012372-T1"
                     /translation="MFTLIMTNNPEIQERAQSEIDKVVGFNRLPNFNDLPALPYVGAL
                     IREMKRWHPVGPLGMAHSNMEDDFYKGYFIPKGASIVPNIWAMAHNPKKYPEPECFRP
                     DRFLKPDGTLNDDTLPWIFGFGRRRCPGTHVADASLWCAITCILATFTIKKLVGQEPE
                     IKWASGLSSYPLPFPCQFVPRKARDAQALTNLVQNSLL"
     gene            1156..>1683
                     /locus_tag="FUN_012373"
     mRNA            join(1156..1231,1285..1421,1479..>1683)
                     /locus_tag="FUN_012373"
                     /product="hypothetical protein"
     CDS             join(1208..1231,1285..1421,1479..>1685)
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="ncbi:FUN_012373-T1"
                     /translation="MVMLVHAEHPPSPFIPRDQPEVVVSKHVKRHNDSVASKRSPTVG
                     ALDDHDGVYLKPNPKLCVTFVGMGSQHPAMGRQLADKYPSFLASIKENDRILMEVYGQ
                     PSLLDRTGLFVPGAKCDLSP"
     assembly_gap    1684..1698
                     /estimated_length=15
                     /gap_type="within scaffold"
                     /linkage_evidence="paired-ends"
ORIGIN      

skylerhar1 avatar Aug 12 '21 18:08 skylerhar1

oops didnt mean to close it :)

skylerhar1 avatar Aug 12 '21 18:08 skylerhar1

Okay, I see the problem now. Neither of those two models have locus_tags in the CDS feature. Both of which seem to be because they abut an assembly gap and for whatever reason NCBI's tbl2asn didn't write a locus_tag for the feature. I'm not sure why.... The code complains that the feature is not valid because it does not have a locus_tag -- without a locus_tag it is unable to assign the CDS to a gene model. NCBI GenBank format is odd in that there is actually nothing required to link the gene, mRNA, and CDS -- to make it worse, sometimes tbl2asn doesn't write features in the same order.

When I take a closer look at the tbl file you sent I think I see the problem. The CDS feature is actually outside of the gene and mRNA feature, for example:

173086	>173862	gene
			locus_tag	FUN_004038
173086	173143	mRNA
173244	173498
173554	173670
173722	>173862
			product	hypothetical protein
			transcript_id	gnl|ncbi|FUN_004038-T1_mrna
			protein_id	gnl|ncbi|FUN_004038-T1
173315	173498	CDS
173554	173670
173722	>173867
			codon_start	1
			product	hypothetical protein
			transcript_id	gnl|ncbi|FUN_004038-T1_mrna
			protein_id	gnl|ncbi|FUN_004038-T1

So I think this means there must be a problem with the adjustment of this gene model in relation to the sequencing gap. The end coordinate of the CDS model should likely read >173862 or the gene and mRNA features should also extend to 173867.

The same is true of the other model:

1156	>1683	gene
			locus_tag	FUN_012373
1156	1231	mRNA
1285	1421
1479	>1683
			product	hypothetical protein
			transcript_id	gnl|ncbi|FUN_012373-T1_mrna
			protein_id	gnl|ncbi|FUN_012373-T1
1208	1231	CDS
1285	1421
1479	>1685
			codon_start	1
			product	hypothetical protein
			transcript_id	gnl|ncbi|FUN_012373-T1_mrna
			protein_id	gnl|ncbi|FUN_012373-T1

I need to see if I can find where this might be happening, ie if it is something that funannotate is doing or if it just comes from PASA directly and is parsed incorrectly.

nextgenusfs avatar Aug 25 '21 00:08 nextgenusfs

@skylerhar1 can you pull out the tbl format for these gene model from funannotate predict, ie in the predict_results/*.tbl file? That will help us determine if PASA modified these to be problematic of if they were that way from predict. Thanks.

nextgenusfs avatar Aug 25 '21 01:08 nextgenusfs

Here it is! (converted to a txt file for github) Rhizopogon_salebrosus_fun_predict.txt

skylerhar1 avatar Aug 25 '21 03:08 skylerhar1

Okay, so PASA is making these changes as the model is different -- and then introducing these "errors". I should be able to find a way to fix them in the update script.

173315	>173808	gene
			locus_tag	FUN_004038
173315	173498	mRNA
173554	173670
173722	>173808
			product	hypothetical protein
			transcript_id	gnl|ncbi|FUN_004038-T1_mrna
			protein_id	gnl|ncbi|FUN_004038-T1
173315	173498	CDS
173554	173670
173722	>173808
			codon_start	1
			product	hypothetical protein
			transcript_id	gnl|ncbi|FUN_004038-T1_mrna
			protein_id	gnl|ncbi|FUN_004038-T1
1208	>1421	gene
			locus_tag	FUN_012373
1208	1231	mRNA
1285	>1421
			product	hypothetical protein
			transcript_id	gnl|ncbi|FUN_012373-T1_mrna
			protein_id	gnl|ncbi|FUN_012373-T1
1208	1231	CDS
1285	>1421
			codon_start	1
			product	hypothetical protein
			transcript_id	gnl|ncbi|FUN_012373-T1_mrna
			protein_id	gnl|ncbi|FUN_012373-T1

nextgenusfs avatar Aug 25 '21 15:08 nextgenusfs

sounds good, let me know if there are any updates

skylerhar1 avatar Sep 07 '21 21:09 skylerhar1

Hi @skylerhar1 sorry for the delay. I'm unable to test this locally because I don't have a dataset that exhibits this behavior. So I pushed some updated code to a new branch and I will need your help to test.

You can install the updated code with this command from your funannotate environment:

python -m pip install git+https://github.com/nextgenusfs/funannotate.git@pasa_update_fix

And then if you could re-run your update command and see what the behavior is, ie does it error out? And then furthermore if you could get me these two gene models FUN_004038 and FUN_012373 I need to see if their coordinates are now fixed in relation to the sequence gap adjacent to each model.

nextgenusfs avatar Sep 11 '21 18:09 nextgenusfs

Should I delete the update_results and update_misc files from the directory and then rerun update, or should I leave them in.

On Sat, Sep 11, 2021 at 11:39 AM Jon Palmer @.***> wrote:

[This email originated from outside of OSU. Use caution with links and attachments.]

Hi @skylerhar1 https://github.com/skylerhar1 sorry for the delay. I'm unable to test this locally because I don't have a dataset that exhibits this behavior. So I pushed some updated code to a new branch and I will need your help to test.

You can install the updated code with this command from your fun annotate environment:

python -m pip install @.***_update_fix

And then if you could re-run your update command and see what the behavior is, ie does it error out? And then furthermore if you could get me these two gene models FUN_004038 and FUN_012373 I need to see if their coordinates are now fixed in relation to the sequence gap adjacent to each model.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/624#issuecomment-917454214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQD3U45T4O6IWTO7LFAXUQDUBOO6PANCNFSM5BUPP7ZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Sincerely, Skyler Har

skylerhar1 avatar Sep 12 '21 19:09 skylerhar1

If you could rename the update_misc and update_results folder that would be good. Then it will run from beginning and we can compare results from old versus new.

nextgenusfs avatar Sep 12 '21 19:09 nextgenusfs

I'm a little confused, the coordinates are now different by quite a bit with these gene models. You sure this is the same input from funannotate predict? It seems like PASA this time either didn't call gene models in these regions or they were substantially changed?

nextgenusfs avatar Sep 27 '21 02:09 nextgenusfs

Thats my bad, I will update you with the correct tbl results soon

skylerhar1 avatar Sep 27 '21 18:09 skylerhar1

Sorry about that, I accidentally used data from a different trial run. Heres the correct tbl file:

173086 >173862 gene locus_tag FUN_004038 173086 173143 mRNA 173244 173498 173554 173670 173722 >173862 product hypothetical protein transcript_id gnl|ncbi|FUN_004038-T1_mrna protein_id gnl|ncbi|FUN_004038-T1 173315 173498 CDS 173554 173670 173722 >173086 codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_004038-T1_mrna protein_id gnl|ncbi|FUN_004038-T1


1156 >1683 gene locus_tag FUN_012373 1156 1231 mRNA 1285 1421 1479 >1683 product hypothetical protein transcript_id gnl|ncbi|FUN_012373-T1_mrna protein_id gnl|ncbi|FUN_012373-T1 1208 1231 CDS 1285 1421 1479 >1156 codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_012373-T1_mrna protein_id gnl|ncbi|FUN_012373-T1

skylerhar1 avatar Sep 30 '21 03:09 skylerhar1

additionally it seems those two gene models were special because while there were 44 gene models that needed correcting, these two had special errors: FUN_000096 Feature begins or ends in gap starting at 214116 FUN_000427 Feature begins or ends in gap starting at 191884 FUN_001052 Feature begins or ends in gap starting at 541846 FUN_001326 Feature begins or ends in gap starting at 278278 FUN_001490 Feature begins or ends in gap starting at 278911 FUN_002558 Feature begins or ends in gap starting at 288775 FUN_003011 Feature begins or ends in gap starting at 131505 FUN_003085 Feature begins or ends in gap starting at 64330 FUN_003209 Feature begins or ends in gap starting at 160428 FUN_003251 Feature begins or ends in gap starting at 33414 FUN_003713 Feature begins or ends in gap starting at 55898 FUN_003721 Feature begins or ends in gap starting at 83050 FUN_004038 Location: Mixed strands in SeqLoc [(lcl|scaffold_33:173315-173498, 173554-173670, c173722-<173086)] FUN_004054 Feature begins or ends in gap starting at 40758 FUN_004391 Feature begins or ends in gap starting at 96178 FUN_005024 Feature begins or ends in gap starting at 22438 FUN_005046 Feature begins or ends in gap starting at 80164 FUN_005685 Feature begins or ends in gap starting at 47866 FUN_006193 Feature begins or ends in gap starting at 35322 FUN_006237 Feature begins or ends in gap starting at 95310 FUN_006594 Feature begins or ends in gap starting at 11874 FUN_006755 Feature begins or ends in gap starting at 29046 FUN_007256 Feature begins or ends in gap starting at 61500 FUN_007718 Feature begins or ends in gap starting at 41587 FUN_008205 Feature begins or ends in gap starting at 6217 FUN_008884 Feature begins or ends in gap starting at 17110 FUN_008915 Feature begins or ends in gap starting at 7657 FUN_008932 Feature begins or ends in gap starting at 4013 FUN_009350 Feature begins or ends in gap starting at 6175 FUN_009476 Feature begins or ends in gap starting at 19797 FUN_009798 Feature begins or ends in gap starting at 15724 FUN_010163 Feature begins or ends in gap starting at 15695 FUN_010720 Feature begins or ends in gap starting at 8141 FUN_011473 Feature begins or ends in gap starting at 5125 FUN_011964 Feature begins or ends in gap starting at 2283 FUN_012373 Location: Mixed strands in SeqLoc [(lcl|scaffold_1074:1208-1231, 1285-1421, c1479-<1156)] FUN_012865 Feature begins or ends in gap starting at 581860 FUN_012869 Feature begins or ends in gap starting at 163870

skylerhar1 avatar Sep 30 '21 04:09 skylerhar1