funannotate Busco scores of predicted proteins much lower than busco of genome

Busco scores of predicted proteins much lower than busco of genome

Open jguhlin opened this issue 2 years ago • 12 comments

Are you using the latest release? Yes

Describe the bug Genome is assembled and gets a high busco score with 5.0 comes out to ~95% with our genomes. But after annotation and running busco in protein mode on the predicted proteins from funannotate (with rna-seq) comes out significantly less (~60% - 75% depending on which run and genome).

Any ideas what would be causing that?

Feb 15 '22 01:02 jguhlin

Are you running BUSCO 5.0 with metaeuk or Augustus? I'm not sure the metaeuk method is generating complete/proper gene models, at least metaeuk doesn't do this on its own. Note I've not looked into comparing the two different BUSCO backends.

Does the genome have a lot of repeats?

Feb 15 '22 01:02 nextgenusfs

I'd say average repeats, it's 3 different genomes here, all insects, and another one with a collaborator (so I haven't seen it directly). BUSCO 5 with metaeuk, but I'll try to switch the backend and see if that works and report back.

Feb 15 '22 01:02 jguhlin

Are the number of genes being called by funannotate reasonable or is it low?

Feb 15 '22 02:02 nextgenusfs

Reasonable, a little higher than related species.

Feb 15 '22 02:02 jguhlin

BUSCO with Augustus backend also came up really high.

nice busco -i masked.fasta --auto-lineage-euk -m genome -o genome_only_busco --cpu 96 --augustus --augustus_species fly

         --------------------------------------------------
        |Results from generic domain eukaryota_odb10      |
        --------------------------------------------------
        |C:97.6%[S:97.6%,D:0.0%],F:1.2%,M:1.2%,n:255      |
        |249    Complete BUSCOs (C)                       |
        |249    Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |3      Fragmented BUSCOs (F)                     |
        |3      Missing BUSCOs (M)                        |
        |255    Total BUSCO groups searched               |
        --------------------------------------------------

        --------------------------------------------------
        |Results from dataset hymenoptera_odb10           |
        --------------------------------------------------
        |C:95.7%[S:95.5%,D:0.2%],F:1.5%,M:2.8%,n:5991     |
        |5733   Complete BUSCOs (C)                       |
        |5724   Complete and single-copy BUSCOs (S)       |
        |9      Complete and duplicated BUSCOs (D)        |
        |89     Fragmented BUSCOs (F)                     |
        |169    Missing BUSCOs (M)                        |
        |5991   Total BUSCO groups searched               |
        --------------------------------------------------

vs.
❯ nice busco -i round1/update_results/Polistes_dominula.proteins.fa --auto-lineage-euk -m prot -o prot_busco --cpu 96
        --------------------------------------------------
        |Results from generic domain eukaryota_odb10      |
        --------------------------------------------------
        |C:73.8%[S:67.5%,D:6.3%],F:3.5%,M:22.7%,n:255     |
        |188    Complete BUSCOs (C)                       |
        |172    Complete and single-copy BUSCOs (S)       |
        |16     Complete and duplicated BUSCOs (D)        |
        |9      Fragmented BUSCOs (F)                     |
        |58     Missing BUSCOs (M)                        |
        |255    Total BUSCO groups searched               |
        --------------------------------------------------

        --------------------------------------------------
        |Results from dataset hymenoptera_odb10           |
        --------------------------------------------------
        |C:64.6%[S:52.4%,D:12.2%],F:3.7%,M:31.7%,n:5991   |
        |3875   Complete BUSCOs (C)                       |
        |3142   Complete and single-copy BUSCOs (S)       |
        |733    Complete and duplicated BUSCOs (D)        |
        |221    Fragmented BUSCOs (F)                     |
        |1895   Missing BUSCOs (M)                        |
        |5991   Total BUSCO groups searched               |
        --------------------------------------------------

My train and predict scripts were pretty simple, but I can take a look at tweaking a bit more. Version is v1.8.9

❯ cat 4_train.sh
nice nice funannotate train -i masked.fasta \
  -o round1 \
  --cpus 128 \
  -l left.fq.gz \
  -r right.fq.gz \
  --max_intronlen 50000

<wd>/Polistes_annotation/Pdom biochemcompute3 in via  funannotate
❯ cat 5_predict.sh
nice nice funannotate predict -i masked.fasta \
  -o round1 -s "Polistes dominula" --organism other --optimize_augustus \
  --busco_db hymenoptera \
  --cpus 32 \
  --busco_seed_species vvul_final

If I run getAnnoFasta.pl on the augustus.gff in predict_misc and then busco on that output, my numbers go up to 80%.

        --------------------------------------------------
        |Results from dataset hymenoptera_odb10           |
        --------------------------------------------------
        |C:80.7%[S:80.4%,D:0.3%],F:4.7%,M:14.6%,n:5991    |
        |4831   Complete BUSCOs (C)                       |
        |4816   Complete and single-copy BUSCOs (S)       |
        |15     Complete and duplicated BUSCOs (D)        |
        |280    Fragmented BUSCOs (F)                     |
        |880    Missing BUSCOs (M)                        |
        |5991   Total BUSCO groups searched               |
        --------------------------------------------------

I'm gonna poke around a bit more and see where I end up, but about to stop work for the day. Cheers!

Feb 15 '22 04:02 jguhlin

Nice summary @jguhlin. So a few things.

How many gene models came out of PASA? So when you run funannotate train it will only use PASA models for training of the ab initio predictors. So I can imagine that if RNA-seq isn't great and this results in poorer PASA models that training with BUSCO might actually be better. I also need to check the code and ensure that the --busco_seed_species is being respected as the starting point for Augustus training. But that would only effect Augustus, not any of the other programs.
I don't find that --optimize_augustus is typically very helpful, if anything I've seen model scores get worse in some instances.
Are you able to pull out the coordinates of the BUSCO models from genome run say in a bed format and then slice the gene_predictions.gff3 file? That would tell you if there are any ab initio models in these regions, ie I'm trying to get at whether there were no predictions in these regions that BUSCO found genes or if there were predictions passed to EVM, but EVM didn't call them genes for some reason.

So other thing to try if you have the compute resources, would be to try to just run BUSCO mediated training, ie something like this:

funannotate predict -i masked.fasta \
  -o busco-only -s "Polistes dominula" --isolate busco --organism other  \
  --busco_db hymenoptera \
  --cpus 32 \
  --busco_seed_species fly

The above will run BUSCO to generate the models to use for training, it will start the process using the pre-built fly parameters.

If the BUSCO mediated training yields improved results, then you could use that training set for another run and then add the PASA and RNA-seq alignments to refine the models.

Feb 15 '22 19:02 nextgenusfs

funannotate predict -i masked.fasta \
  -o busco-only -s "Polistes dominula" --isolate busco --organism other  \
  --busco_db hymenoptera \
  --cpus 32 \
  --busco_seed_species fly

Gave me:

busco -i busco-only/predict_results/Polistes_dominula_busco.proteins.fa --lineage hymenoptera -m prot -o prot_busco_only_training --cpu 128

2022-02-23 09:55:24 INFO:	***** Start a BUSCO v5.2.2 analysis, current time: 02/23/2022 09:55:24 *****
.....
	--------------------------------------------------
	|Results from dataset hymenoptera_odb10           |
	--------------------------------------------------
	|C:0.2%[S:0.2%,D:0.0%],F:0.1%,M:99.7%,n:5991      |
	|12	Complete BUSCOs (C)                       |
	|12	Complete and single-copy BUSCOs (S)       |
	|0	Complete and duplicated BUSCOs (D)        |
	|5	Fragmented BUSCOs (F)                     |
	|5974	Missing BUSCOs (M)                        |
	|5991	Total BUSCO groups searched               |
	--------------------------------------------------

Looks like Augustus failed on this, but GlimmerHMM came up with a ton of predictions:

  Source         Weight   Count
  Augustus       1        1778
  Augustus HiQ   2        91
  GlimmerHMM     1        38541
  snap           1        257
  Total          -        40667

Which is really odd. Going to dig into it.

Update: Somehow Nextpolish or something masked 99.7% of the genome, so I've remasked it with funannotate (and a perl script to capitalize everything). Now it's 20.22% masked and giving it a re-run. Also doing another run with optimize-augustus as the only difference (on the correct masked one). Will report back.

Feb 22 '22 20:02 jguhlin

Yep, that was causing some problems. Not all of the genomes have that problem just the two wasps.

Busco only trained, fixed masking

	--------------------------------------------------
	|Results from dataset hymenoptera_odb10           |
	--------------------------------------------------
	|C:78.8%[S:78.6%,D:0.2%],F:5.3%,M:15.9%,n:5991    |
	|4721	Complete BUSCOs (C)                       |
	|4707	Complete and single-copy BUSCOs (S)       |
	|14	Complete and duplicated BUSCOs (D)        |
	|315	Fragmented BUSCOs (F)                     |
	|955	Missing BUSCOs (M)                        |
	|5991	Total BUSCO groups searched               |
	--------------------------------------------------

Busco only trained, fixed masking, optimize augustus (small decrease)

	--------------------------------------------------
	|Results from dataset hymenoptera_odb10           |
	--------------------------------------------------
	|C:78.0%[S:77.7%,D:0.3%],F:9.1%,M:12.9%,n:5991    |
	|4671	Complete BUSCOs (C)                       |
	|4656	Complete and single-copy BUSCOs (S)       |
	|15	Complete and duplicated BUSCOs (D)        |
	|546	Fragmented BUSCOs (F)                     |
	|774	Missing BUSCOs (M)                        |
	|5991	Total BUSCO groups searched               |
	--------------------------------------------------

I'm going to check into PASA and such yet, as the next step (your first 3 suggestions) now that this is working properly.

Feb 23 '22 19:02 jguhlin

It looks the snap failed here. see https://github.com/nextgenusfs/funannotate/issues/386 for the fixing. Actually I also failed running snap before, and the protein busco value turns out very low (1/2 of genome level busco). But after fixing the snap issue, the busco value reached to almost the same of the genome level. Hope it helps.

Oct 25 '22 15:10 leegene1992

Hi @jguhlin and @nextgenusfs, I have a similar issue where using RNA-seq data as training for PASA seems to give very poor gene models (BUSCO completeness of ~20%) despite the genome assembly having a BUSCO completeness of ~95% for tetrapoda db. To verify if the RNA-seq data was poor, I ran BUSCO on trinity and stringtie-derived transcriptome models and I can confirm the data is of low quality so probably best to avoid this for funannotate.

@jguhlin did you re-run funannotate train using busco db only? How did you de-prioritize/avoid the PASA training step and then feed that output into funannotate predict?

I'm guessing the best approach is to use funannotate predict without the RNA-seq data but my question is if it is worth training on BUSCO DB first and then predict?

Thanks for the help!

Jun 14 '23 16:06 kushalsuryamohan

@kushalsuryamohan Hey, what I did was use the intermediate output (temp or work directory, look for augustus.gff3 file) and use that, it has much higher BUSCO scores.

As I'm not working on fungal genomes, but rather other eukaryotes, I've actually switched to GALBA + BRAKER3 (when I have RNA-Seq, which is rare). Hope this helps! Let me know if you can't find the intermediate augustus files and I can try to get the exact file name.

Jun 19 '23 22:06 jguhlin

Hi @jguhlin, thanks for getting back to me. Is this the directory you refer to that contains the augustus.gff3?

-bash-4.2$ ls -lrt busco_only_training/predict_misc/ total 2740908 -rw-rw-r-- 1 kushal.s kushal.s 116191432 Jun 14 14:19 repeatmasker.bed -rw-rw-r-- 1 kushal.s kushal.s 338630 Jun 14 14:19 assembly-gaps.bed -rw-rw-r-- 1 kushal.s kushal.s 1789662422 Jun 14 14:19 genome.softmasked.fa -rw-rw-r-- 1 kushal.s kushal.s 212745361 Jun 14 14:20 proteins.combined.fa -rw-rw-r-- 1 kushal.s kushal.s 1852943 Jun 14 15:04 protein_alignments.gff3 -rw-rw-r-- 1 kushal.s kushal.s 25120123 Jun 14 15:04 p2g.diamond.out -rw-rw-r-- 1 kushal.s kushal.s 1929922 Jun 14 15:05 hints.P.gff -rw-rw-r-- 1 kushal.s kushal.s 1929922 Jun 14 15:05 hints.all.tmp -rw-rw-r-- 1 kushal.s kushal.s 1929922 Jun 14 15:05 hints.all.sort.tmp -rw-rw-r-- 1 kushal.s kushal.s 205716 Jun 14 15:05 hints.ALL.gff -rw-rw-r-- 1 kushal.s kushal.s 2178428 Jun 15 01:25 gmhmm.mod drwxrwxr-x 6 kushal.s kushal.s 159 Jun 15 03:28 genemark -rw-rw-r-- 1 kushal.s kushal.s 118537051 Jun 15 03:28 genemark.gff -rw-rw-r-- 1 kushal.s kushal.s 106049608 Jun 15 03:29 genemark.temp.gff -rw-rw-r-- 1 kushal.s kushal.s 106049608 Jun 15 03:29 genemark.evm.gff3.bak -rw-rw-r-- 1 kushal.s kushal.s 106049608 Jun 15 03:29 genemark.evm.gff3 drwxrwxr-x 3 kushal.s kushal.s 78 Jun 15 03:29 ab_initio_parameters drwxrwxr-x 4 kushal.s kushal.s 64 Jun 15 03:30 busco -rw-rw-r-- 1 kushal.s kushal.s 84516 Jun 20 04:49 buscos.bed -rw-rw-r-- 1 kushal.s kushal.s 6864440 Jun 20 04:51 busco_augustus.tmp -rw-rw-r-- 1 kushal.s kushal.s 6291530 Jun 20 04:51 busco_augustus.tmp.intermediate -rw-rw-r-- 1 kushal.s kushal.s 6388540 Jun 20 04:51 busco_augustus.gff3 -rw-rw-r-- 1 kushal.s kushal.s 1081838 Jun 20 04:51 busco_augustus.proteins.fasta drwxrwxr-x 4 kushal.s kushal.s 64 Jun 20 04:52 busco_proteins -rw-rw-r-- 1 kushal.s kushal.s 6282686 Jun 20 04:55 busco.final.gff3 -rw-rw-r-- 1 kushal.s kushal.s 93542838 Jun 20 04:55 augustus.training.busco.gb -rw-rw-r-- 1 kushal.s kushal.s 83861835 Jun 20 04:55 augustus.training.busco.gb.train -rw-rw-r-- 1 kushal.s kushal.s 9681003 Jun 20 04:55 augustus.training.busco.gb.test -rw-rw-r-- 1 kushal.s kushal.s 1779896 Jun 20 05:04 augustus.initial.training.txt

The busco-only approach seems to be doing worse. See below the log:

[Jun 14 02:08 PM]: OS: CentOS Linux 7, 104 cores, ~ 791 GB RAM. Python: 3.8.15 [Jun 14 02:08 PM]: Running funannotate v1.8.15 [Jun 14 02:08 PM]: Skipping CodingQuarry as no --rna_bam passed [Jun 14 02:08 PM]: Parsed training data, run ab-initio gene predictors as follows: ESC[4mProgram Training-MethodESC[0m augustus busco
genemark selftraining
glimmerhmm busco
snap busco
[Jun 14 02:17 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [Jun 14 02:19 PM]: Genome loaded: 3,111 scaffolds; 1,767,527,460 bp; 53.75% repeats masked [Jun 14 02:20 PM]: Mapping 556,668 proteins to genome using diamond and exonerate [Jun 14 02:47 PM]: Found 184,478 preliminary alignments with diamond in 0:25:22 --> generated FASTA files for exonerate in 0:02:12 [Jun 14 03:04 PM]: Exonerate finished in 0:16:56: found 5,231 alignments [Jun 14 03:05 PM]: Running GeneMark-ES on assembly [Jun 15 03:29 AM]: 62,857 predictions from GeneMark [Jun 15 03:29 AM]: Running BUSCO to find conserved gene models for training ab-initio predictors [Jun 20 04:49 AM]: 2,136 valid BUSCO predictions found, validating protein sequences [Jun 20 04:55 AM]: 2,111 BUSCO predictions validated [Jun 20 04:55 AM]: Training Augustus using BUSCO gene models [Jun 20 05:04 AM]: Augustus initial training results: ESC[4mFeature Specificity SensitivityESC[0m nucleotides 69.9% 40.8%
exons 31.3% 36.8%
genes 2.8% 1.2%
[Jun 20 05:04 AM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option. [Jun 20 05:04 AM]: Running Augustus gene prediction using dispholidus_typus_busco parameters

Should I try optimizing augustus? Any suggestions @nextgenusfs?

Jun 20 '23 16:06 kushalsuryamohan

funannotate funannotate copied to clipboard

Busco scores of predicted proteins much lower than busco of genome

funannotate
funannotate copied to clipboard