funannotate icon indicating copy to clipboard operation
funannotate copied to clipboard

Funannotate Run time + Perl error

Open ChuChuChaddy opened this issue 1 year ago • 5 comments

This is not so much as a bug as a series of small questions. I am a toxicologist who was thrown into the deep end of the bioinformatics pool and I am learning as I go. I am using Funannotate on a gastropod. I may have received some questionable advice from an older graduate student when building the protein database. I was told to include many proteins from eukaryotes, molluscs and gastropods. This made sense to me at the time...until funannotate predict had this in the log:

[Jun 06 09:18 PM]: Mapping 32,394,181 proteins to genome using diamond and exonerate [Jun 07 04:24 PM]: Found 18,235,649 preliminary alignments with diamond in 8:10:40 --> generated FASTA files for exonerate in 10:54:57

It has been running for about a week now. The admin for our cluster says that it is still running and doing 'work' but it's not writing files, adding to logs, or anything. I totally get that dealing with 18 million alignments is an absurd amount of work, but should it be adding data to the funannotate output directory?

Follow up question to that, is it worth including all proteins when I'm running funannotate with the optimize augustus option?

I've been following the Funannotate read the docs. I've done funannotate train on my transcriptome and I was able to include that in the functional annotation for the genome. For the genome, I did the genome preparation, and moved on to prediction. I've run Interproscan and Eggnog in preparation for the annotation step, and I'm waiting on funannotate predict to wrap up.

Command: apptainer exec -B /mydir /mydir/funannotate_latest.sif funannotate predict --cpus 20 -i gastropod.masked.fa --optimize_augustus -d /mydir/funannotate-1.8.15/fundb
--rna_bam ./training/trinity.alignments.bam --transcript_evidence ./training/trinity.fasta --protein_evidence ./Euk_Gas_Moll.fa
--AUGUSTUS_CONFIG_PATH="/mydir/Augustus/config"
--out predict1 --species dikarya --busco_db="/mydir/busco/mollusca_odb10"

*I had to install genemark and augustus on our cluster (which was tough) because if I didn't, funannotate threw a perl error about @INC that I couldn't figure out how to fix. **I had to go with the default species because I kept getting errors when I would change it.

Perl error: [06/05/23 09:22:00]: Can't locate Data/Dumper.pm in @INC (you may need to install the Data::Dumper module) (@INC contains: /mydir/perl5/lib/perl5 /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at /mydir/software/gmes_linux_64_4/gmes_petap.pl line 83. BEGIN failed--compilation aborted at /mydir/software/gmes_linux_64_4/gmes_petap.pl line 83.

I installed Data/Dumper and all pre-reqs, but @INC would not include my perl directory and I couldn't figure out how to check the perl included in the docker container.

Thanks for your time! Any help would be appreciated.

ChuChuChaddy avatar Jun 14 '23 17:06 ChuChuChaddy

you can check what the #!/usr/bin/perl or #!/usr/bin/env perl in the /mydir/software/gmes_linux_64_4/gmes_petap.pl file - depending on if you have a conda env for your perl with the modules installed you need to make sure the script is using the loaded perl #!/usr/bin/env perl in the header does this - instead of forced for the system perl install

hyphaltip avatar Jun 14 '23 19:06 hyphaltip

generally we build a conda env for installation and then all the dependencies are part of that environment but if you are using external tool like genemark it will cause that issue. you can also skip genemark by setting the weight to 0 I think in the predict step adding --weights genemark:0 I believe will have it skip it all together

It seems like you have it a lot (32 M) of proteins with ./Euk_Gas_Moll.fa - have you considered reducing that with cd-hit to cluster at 90% or even 75% identity and see if that reduces the number of proteins, otherwise that's 18M putative gene models exonerate is going to screen through which will take a lot of time. We often just use swissprot as a starting point for this in some fungal datasets, and it will certainly run faster and you can see the performance before you decide to take the longer wait on the large alignment set.

hyphaltip avatar Jun 14 '23 19:06 hyphaltip

not sure what **I had to go with the default species because I kept getting errors when I would change it. - can you clarify where you had trouble with species - I don't see where you used any default here?

hyphaltip avatar Jun 14 '23 19:06 hyphaltip

not sure what **I had to go with the default species because I kept getting errors when I would change it. - can you clarify where you had trouble with species - I don't see where you used any default here?

I apologize. I remembered that being one of the defaults. I may have misunderstood.

It is in funannotate outgroups -b, --busco_db BUSCO db to use. Default. dikarya

ChuChuChaddy avatar Jun 14 '23 19:06 ChuChuChaddy

Can you restate your problem. I don’t understand what you are stuck on now. The long run time is not unexpected for a db or 35M protein alignments to sort through I don’t think depending on how many cpus you throw at this.

What’s the question about Busco. You need to provide a name of a Busco db and download that Busco db to local folder running fuannotate or out in the funannotate_db

On Wed, Jun 14, 2023 at 12:53 PM Chad Mansfield @.***> wrote:

not sure what **I had to go with the default species because I kept getting errors when I would change it. - can you clarify where you had trouble with species - I don't see where you used any default here?

I apologize. I remembered that being one of the defaults. I may have misunderstood.

It is in funannotate outgroups -b, --busco_db BUSCO db to use. Default. dikarya

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/925#issuecomment-1591884723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL5O3NWXHM5ZQVIAQTI3DXLIJDHANCNFSM6AAAAAAZGW5FAM . You are receiving this because you commented.Message ID: @.***>

-- Sent from Gmail Mobile

Jason Stajich - @.***

hyphaltip avatar Jun 22 '23 03:06 hyphaltip