funannotate
funannotate copied to clipboard
Final GFF is listing human gene names with random underscores at the end
Are you using the latest release? Yes.
Describe the bug I've obtained the final gff after running 'funannotate annotate'. However, I've noticed that a lot of the gene names have underscores in them, when they shouldn't be there. For example, "PNRC2_1", "PNRC2_2", "ACO1_1", "ACO1_2", when the actual genes names should be listed as "PNRC2" and "ACO1". These are human genes, for reference. Can someone please explain what's happening, and how I can get it to print out the gene names properly? Thank you.
What command did you issue? #funannotate annotate -i ./Output --busco_db mammalia --cpus 16 --eggnog ./test.emapper.annotations
OS/Install Information
Checking dependencies for 1.8.15
You are running Python v 3.8.15. Now checking python packages... biopython: 1.76 goatools: 1.3.1 matplotlib: 3.4.3 natsort: 8.4.0 numpy: 1.24.2 pandas: 2.0.0 psutil: 5.9.5 requests: 2.31.0 scikit-learn: 1.3.0 scipy: 1.10.1 seaborn: 0.12.2 All 11 python packages installed
You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.50 Clone: 0.46 DBD::SQLite: 1.72 DBD::mysql: 4.046 DBI: 1.643 DB_File: 1.858 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.54 Hash::Merge: 0.302 JSON: 4.10 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.17 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 All 27 Perl modules installed
Checking Environmental Variables... $FUNANNOTATE_DB=/home/xxx/funannotate_db $PASAHOME=/home/xxx/miniconda3/envs/funannotate/opt/pasa-2.5.3 $TRINITY_HOME=/home/xxx/miniconda3/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/home/xxx/miniconda3/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/home/xxx/miniconda3/envs/funannotate/config/ ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir
Checking external dependencies... PASA: 2.5.3 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.5.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.31.0 blat: BLAT v37x1 diamond: 2.1.8 emapper.py: 2.1.10 ete3: 3.1.3 exonerate: exonerate 2.4.0 fasta: 36.3.8g glimmerhmm: 3.0.4 gmap: 2023-07-20 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 17.0.3-internal kallisto: 0.46.1 mafft: v7.520 (2023/Mar/22) makeblastdb: makeblastdb 2.14.1+ minimap2: 2.26-r1175 pigz: 2.6 proteinortho: 6.3.0 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.17 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.12 (Nov 2022) tantan: tantan 40 tbl2asn: 25.8 tblastn: tblastn 2.14.1+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: gmes_petap.pl not installed ERROR: signalp not installed
I've obtained the final gff after running 'funannotate annotate'. However, I've noticed that a lot of the gene names have underscores in them, when they shouldn't be there. For example, "PNRC2_1", "PNRC2_2", "ACO1_1", "ACO1_2", when the actual genes names should be listed as "PNRC2" and "ACO1". These are human genes, for reference. Can someone please explain what's happening, and how I can get it to print out the gene names properly? Thank you.
The scripts will do this when there are two homologs present, ie do you have a haploid assembly or are there two copies of some genes?
Hi Jon, thanks for the reply!
This a previously uncharacterized diploid mammal species. The transcript support came from a de novo Trinity transcriptome reconstruction, and the protein support was a concatenated, non-redundant fasta with swissprot protein sequences from other mammals. Not sure what you mean by two copies of some genes...how would this arise? From the genome assembly?
I greatly appreciate your help, thank you.
Is your assembly diploid or haploid was all I was asking. If it's diploid (two copies of each gene) that would explain the behavior.
Yes, diploid. How can I circumvent this issue though? Can I just rename the genes, removing the underscores?
I wouldn't call it an "issue" you can't have two identical gene names I don't think per NCBI rules. Assemblies were haploidized traditionally, but since I don't work on diploids I don't know what the current rules are as technology now allows for phased assemblies.