funannotate icon indicating copy to clipboard operation
funannotate copied to clipboard

Not enough models to train augustus

Open aberaslop opened this issue 3 years ago • 1 comments

Hi Jon!

I have encountered a problem when running a new funannotate analysis with a new installation. As always, thank you so much for your help!

L.

Are you using the latest release? I think so: funannotate v1.8.9, installed in April 2022.

Describe the bug Analysis stops after checking for training parameters because there are no enough models to run augustus (199/200). I attempted a previous run without ESTs as evidence, (only RNA data from funannotate_train) and it also stopped at the same point. Once I added the EST data, I also got a trinity error. This could perhaps be because the RNA seq is from a species within the genus, but not the same species. I saw other people with a similar problem with augustus and the recommendation was to downgrade the program, but I already have augustus 3.3.3, so I guess that is not the problem. I still have not tried --min_training_models, it may be that tweaking this number solves the problem, but I would like to know what it is originating it in the first place.

What command did you issue? funannotate predict -i FSPLA_clean_sorted_masked.fasta --transcript_evidence evidence/FusoxFo47_2_EST_20201023_cluster_consensi.fasta evidence/Fusso1_EST_20180401_cluste_consensi.fasta evidence/Necha2_ESTs.fasta evidence/FoxII5_EST_20201108_cluster_consensi.fasta -o funannotate_train -s "Fusarium sp" --cpus 15 --name FSPLA

Logfiles Please provide relevant log files of the error. funannotate-predict.log script.output.txt

OS/Install Information

  • output of funannotate check --show-versions

Checking dependencies for 1.8.9

You are running Python v 3.8.12. Now checking python packages... biopython: 1.77 goatools: 1.2.3 matplotlib: 3.4.3 natsort: 8.1.0 numpy: 1.22.3 pandas: 1.4.1 psutil: 5.9.0 requests: 2.27.1 scikit-learn: 1.0.2 scipy: 1.8.0 seaborn: 0.11.2 All 11 python packages installed

You are running Perl v b'5.026002'. Now checking perl modules... Carp: 1.38 Clone: 0.42 DBD::SQLite: 1.64 DBD::mysql: 4.046 DBI: 1.642 DB_File: 1.855 Data::Dumper: 2.173 File::Basename: 2.85 File::Which: 1.23 Getopt::Long: 2.5 Hash::Merge: 0.300 JSON: 4.02 LWP::UserAgent: 6.39 Logger::Simple: 2.0 POSIX: 1.76 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.12 Tie::File: 1.02 URI::Escape: 3.31 YAML: 1.29 threads: 2.15 threads::shared: 1.56 ERROR: Bio::Perl not installed, install with cpanm Bio::Perl

Checking Environmental Variables... $FUNANNOTATE_DB=/vm/share/aylin/software/funannotateDB $PASAHOME=/vm/share/aylin/anaconda/envs/funannotate/opt/pasa-2.4.1 $TRINITY_HOME=/vm/share/aylin/anaconda/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/vm/share/aylin/anaconda/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/vm/share/aylin/anaconda/envs/funannotate/config/ $GENEMARK_PATH=/vm/share/aylin/software/gmes_linux_64_4 All 6 environmental variables are set

Checking external dependencies... PASA: 2.4.1 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.3.3 bamtools: bamtools 2.5.1 bedtools: bedtools v2.30.0 blat: BLAT v36 diamond: 2.0.14 emapper.py: 2.1.7 ete3: 3.1.2 exonerate: exonerate 2.4.0 fasta: no way to determine glimmerhmm: 3.0.4 gmap: 2017-11-15 gmes_petap.pl: 4.69_lic hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 11.0.13 kallisto: 0.46.1 mafft: v7.490 (2021/Oct/30) makeblastdb: makeblastdb 2.2.31+ minimap2: 2.24-r1122 proteinortho: 6.0.33 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.12 signalp: 5.0b snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.9 (July 2021) tantan: tantan 26 tbl2asn: no way to determine, likely 25.X tblastn: tblastn 2.2.31+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 All 36 external dependencies are installed

aberaslop avatar May 25 '22 14:05 aberaslop

you can set a lower bound for number of training models with --min_training_models 150 but I think this refers to BUSCO-based training. I think if your training with RNASeq is not generating many models this indicates the RNA is not really matching your genome. Trinity errors should be fixed first but I wonder if this is because your reads are not aligning to the genome?

a) let's say you are training with cross-species RNA and so Trinity-GG fails because not enough reads match the genome so don't get any alignments, thus no trinity transcripts b) you could de novo assemble transcripts with trinity and provide this as input - however Train will not really work if you don't have RNASeq to help the system choose top expressed genes (@nextgenusfs we may talk about whether there is a path to run transcript-based training without raw RNASeq)

if your RNAseq is not helping you coudl remove the training folder, and run predict where you give it explicitly an assembled transcript file ( --transcript_evidence) - this would trigger BUSCO-based training only and then give the mRNA in supporting augustus gene model support.

hyphaltip avatar May 25 '22 17:05 hyphaltip