funannotate Final GFF is listing human gene names with random underscores at the end

Are you using the latest release? Yes.

Describe the bug I've obtained the final gff after running 'funannotate annotate'. However, I've noticed that a lot of the gene names have underscores in them, when they shouldn't be there. For example, "PNRC2_1", "PNRC2_2", "ACO1_1", "ACO1_2", when the actual genes names should be listed as "PNRC2" and "ACO1". These are human genes, for reference. Can someone please explain what's happening, and how I can get it to print out the gene names properly? Thank you.

What command did you issue? #funannotate annotate -i ./Output --busco_db mammalia --cpus 16 --eggnog ./test.emapper.annotations

OS/Install Information

Checking dependencies for 1.8.15

You are running Python v 3.8.15. Now checking python packages... biopython: 1.76 goatools: 1.3.1 matplotlib: 3.4.3 natsort: 8.4.0 numpy: 1.24.2 pandas: 2.0.0 psutil: 5.9.5 requests: 2.31.0 scikit-learn: 1.3.0 scipy: 1.10.1 seaborn: 0.12.2 All 11 python packages installed

You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.50 Clone: 0.46 DBD::SQLite: 1.72 DBD::mysql: 4.046 DBI: 1.643 DB_File: 1.858 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.54 Hash::Merge: 0.302 JSON: 4.10 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.17 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/home/xxx/funannotate_db $PASAHOME=/home/xxx/miniconda3/envs/funannotate/opt/pasa-2.5.3 $TRINITY_HOME=/home/xxx/miniconda3/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/home/xxx/miniconda3/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/home/xxx/miniconda3/envs/funannotate/config/ ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir

Checking external dependencies... PASA: 2.5.3 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.5.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.31.0 blat: BLAT v37x1 diamond: 2.1.8 emapper.py: 2.1.10 ete3: 3.1.3 exonerate: exonerate 2.4.0 fasta: 36.3.8g glimmerhmm: 3.0.4 gmap: 2023-07-20 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 17.0.3-internal kallisto: 0.46.1 mafft: v7.520 (2023/Mar/22) makeblastdb: makeblastdb 2.14.1+ minimap2: 2.26-r1175 pigz: 2.6 proteinortho: 6.3.0 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.17 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.12 (Nov 2022) tantan: tantan 40 tbl2asn: 25.8 tblastn: tblastn 2.14.1+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: gmes_petap.pl not installed ERROR: signalp not installed

Sep 25 '23 16:09 jdee3

I've obtained the final gff after running 'funannotate annotate'. However, I've noticed that a lot of the gene names have underscores in them, when they shouldn't be there. For example, "PNRC2_1", "PNRC2_2", "ACO1_1", "ACO1_2", when the actual genes names should be listed as "PNRC2" and "ACO1". These are human genes, for reference. Can someone please explain what's happening, and how I can get it to print out the gene names properly? Thank you.

The scripts will do this when there are two homologs present, ie do you have a haploid assembly or are there two copies of some genes?

Oct 24 '23 15:10 nextgenusfs

Hi Jon, thanks for the reply!

This a previously uncharacterized diploid mammal species. The transcript support came from a de novo Trinity transcriptome reconstruction, and the protein support was a concatenated, non-redundant fasta with swissprot protein sequences from other mammals. Not sure what you mean by two copies of some genes...how would this arise? From the genome assembly?

I greatly appreciate your help, thank you.

Oct 30 '23 07:10 jdee3

Is your assembly diploid or haploid was all I was asking. If it's diploid (two copies of each gene) that would explain the behavior.

Oct 30 '23 15:10 nextgenusfs

Yes, diploid. How can I circumvent this issue though? Can I just rename the genes, removing the underscores?

Nov 06 '23 08:11 jdee3

I wouldn't call it an "issue" you can't have two identical gene names I don't think per NCBI rules. Assemblies were haploidized traditionally, but since I don't work on diploids I don't know what the current rules are as technology now allows for phased assemblies.

Nov 06 '23 15:11 nextgenusfs

funannotate funannotate copied to clipboard

Final GFF is listing human gene names with random underscores at the end

OS/Install Information

Checking dependencies for 1.8.15

funannotate
funannotate copied to clipboard