clusterProfiler icon indicating copy to clipboard operation
clusterProfiler copied to clipboard

Mitochondrial Genes and Annotations missing from EnrichGO

Open Dragonmasterx87 opened this issue 1 year ago • 4 comments

Hi Great tool thanks for maintaining it.

i have a question that relates to the lack of mitochondrial genome encoded genes and their corresponding ontologies.

In the following code:

hgnc <- c("MT-ATP6", "MT-ATP8", "MT-CO1", "MT-CO2", "MT-CO3", "MT-CYB", "MT-ND1", "MT-ND2", "MT-ND3", "MT-ND4", "MT-ND4L", "MT-ND5", "MT-ND6", "MT-RNR1", "MT-RNR2")
  
  # Run GO enrichment analysis genes up
  GO.up <- enrichGO(gene = hgnc , 
                    universe = all_genes, 
                    keyType = "SYMBOL", #keytypes(org.Hs.eg.db)
                    OrgDb = org.Hs.eg.db, 
                    ont = c("ALL"), 
                    pAdjustMethod = "BH", 
                    pvalueCutoff = 1, 
                    qvalueCutoff = 1, #if not set default is at 0.05
                    readable = TRUE)

I get a NULL, which is impossible because all these terms are part of multiple mitochondrial/ETC component annotations. I noticed that this could be a symbol issue, but upon using shorthand symbols (which is incorrect they should be HNGC) like ATP6, ATP8 etc, I found that the answer was still NULL.

However this could be a GO 2023 problem. If so, can one utilize older GO like from 2019 or 2018? If so how?

Thanks a lot.!

🐉

Dragonmasterx87 avatar Oct 21 '24 20:10 Dragonmasterx87

I believe the issue is that the hgnc symbols are not recognized c.q. compatible with the org.Hs.eg.db annotation, which is NCBI-based.

If the corresponding ENTREZID are used, GO terms are returned...

> library(org.Hs.eg.db)
> 
> hgnc <- c("MT-ATP6", "MT-ATP8", "MT-CO1", "MT-CO2", "MT-CO3", "MT-CYB", "MT-ND1", "MT-ND2",
+           "MT-ND3", "MT-ND4", "MT-ND4L", "MT-ND5", "MT-ND6", "MT-RNR1", "MT-RNR2")
> 
> ## nothing is found
> AnnotationDbi:::select(org.Hs.eg.db, keys = hgnc, keytype = "SYMBOL",
+                columns = c("ENTREZID", "SYMBOL", "GENENAME") )
Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method to see a listing of valid arguments.
> 
> ## try ALIAS
> ## almost nothing is found
> AnnotationDbi:::select(org.Hs.eg.db, keys = hgnc, keytype = "ALIAS",
+                columns = c("ENTREZID", "SYMBOL", "GENENAME") )
'select()' returned 1:1 mapping between keys and columns
     ALIAS  ENTREZID   SYMBOL             GENENAME
1  MT-ATP6      <NA>     <NA>                 <NA>
2  MT-ATP8      <NA>     <NA>                 <NA>
3   MT-CO1      <NA>     <NA>                 <NA>
4   MT-CO2 107075310 MTCO2P12 MT-CO2 pseudogene 12
5   MT-CO3      <NA>     <NA>                 <NA>
6   MT-CYB      <NA>     <NA>                 <NA>
7   MT-ND1      <NA>     <NA>                 <NA>
8   MT-ND2      <NA>     <NA>                 <NA>
9   MT-ND3      <NA>     <NA>                 <NA>
10  MT-ND4      <NA>     <NA>                 <NA>
11 MT-ND4L      <NA>     <NA>                 <NA>
12  MT-ND5      <NA>     <NA>                 <NA>
13  MT-ND6      <NA>     <NA>                 <NA>
14 MT-RNR1      <NA>     <NA>                 <NA>
15 MT-RNR2      <NA>     <NA>                 <NA>
> 
> 
> ## manually look up your first 2 entries:
> ## MT-ATP6 has ENTREZID 4508 (https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7414)
> ## MT-ATP8 has ENTREZID 4509 (https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7415)
> 
> 
> AnnotationDbi:::select(org.Hs.eg.db, keys = c("4508","4509"), keytype = "ENTREZID",
+                columns = c("ENTREZID", "SYMBOL", "ALIAS", "GENENAME", "GOALL") ) [c(1:4, 1808:1812) ,]
'select()' returned 1:many mapping between keys and columns
     ENTREZID SYMBOL   ALIAS                  GENENAME      GOALL EVIDENCEALL
1        4508   ATP6 ATPase6 ATP synthase F0 subunit 6 GO:0003674         IBA
2        4508   ATP6 ATPase6 ATP synthase F0 subunit 6 GO:0003674         IDA
3        4508   ATP6 ATPase6 ATP synthase F0 subunit 6 GO:0003674         IPI
4        4508   ATP6 ATPase6 ATP synthase F0 subunit 6 GO:0003824         IBA
1808     4509   ATP8    ATP8 ATP synthase F0 subunit 8 GO:1902600         IEA
1809     4509   ATP8    ATP8 ATP synthase F0 subunit 8 GO:1904949         IBA
1810     4509   ATP8    ATP8 ATP synthase F0 subunit 8 GO:1904949         IDA
1811     4509   ATP8    ATP8 ATP synthase F0 subunit 8 GO:1904949         IEA
1812     4509   ATP8    ATP8 ATP synthase F0 subunit 8 GO:1904949         NAS
     ONTOLOGYALL
1             MF
2             MF
3             MF
4             MF
1808          BP
1809          CC
1810          CC
1811          CC
1812          CC
> 
> 
> packageVersion("org.Hs.eg.db")
[1] ‘3.19.1’
> 

LATER ADDED: Yet, when you check the NCBI gene page for the first entry (4508) (here) the official symbol is MT-ATP6!

Mmm...?? Maybe something to report on the Bioconductor support forum?

guidohooiveld avatar Oct 22 '24 14:10 guidohooiveld

For the archive: I have reported this issue at the Bioconductor support site, and it turns out that in the OrgDb the 'default' names and symbols are used, and not the 'official' ones... This will be rectified, but when exactly this is not known at the moment.

See: https://support.bioconductor.org/p/9160426/

guidohooiveld avatar Nov 06 '24 16:11 guidohooiveld

It seems like NCBI wont be making the change any time soon though, but at least we now know what the problem is. I transitioned to g:GOSt which doesn't have this problem.

Dragonmasterx87 avatar Nov 06 '24 18:11 Dragonmasterx87

For the archive: since Bioconductor release 3.21 the official symbols are used in org.Hs.eg.db. The code in the 1st post thus works.

> library(org.Hs.eg.db)
> library(clusterProfiler)
> library(org.Hs.eg.db)
> hgnc <- c("MT-ATP6", "MT-ATP8", "MT-CO1", "MT-CO2", "MT-CO3", "MT-CYB",
          "MT-ND1", "MT-ND2", "MT-ND3", "MT-ND4", "MT-ND4L", "MT-ND5", "MT-ND6", 
          "MT-RNR1", "MT-RNR2")
> GO.up <- enrichGO(gene = hgnc , 
                  keyType = "SYMBOL",
                  OrgDb = org.Hs.eg.db, 
                  ont = c("ALL"), 
                  pAdjustMethod = "BH", 
                  pvalueCutoff = 1, 
                  qvalueCutoff = 1,
                  readable = TRUE)
> GO.up
#
#...@organism    Homo sapiens 
#...@ontology    GOALL 
#...@keytype     SYMBOL 
#...@gene        chr [1:15] "MT-ATP6" "MT-ATP8" "MT-CO1" "MT-CO2" "MT-CO3" "MT-CYB" ...
#...pvalues adjusted by 'BH' with cutoff <1 
#...125 enriched terms found
'data.frame':   125 obs. of  13 variables:
 $ ONTOLOGY      : chr  "BP" "BP" "BP" "BP" ...
 $ ID            : chr  "GO:0006119" "GO:1902600" "GO:0009060" "GO:0045333" ...
 $ Description   : chr  "oxidative phosphorylation" "proton transmembrane transport" "aerobic respiration" "cellular respiration" ...
 $ GeneRatio     : chr  "13/13" "13/13" "13/13" "13/13" ...
 $ BgRatio       : chr  "151/18805" "190/18805" "201/18805" "245/18805" ...
 $ RichFactor    : num  0.0861 0.0684 0.0647 0.0531 0.1183 ...
 $ FoldEnrichment: num  124.5 99 93.6 76.8 171.1 ...
 $ zScore        : num  40.1 35.7 34.7 31.4 43.2 ...
 $ pvalue        : num  3.41e-28 7.55e-27 1.61e-26 2.26e-25 1.82e-24 ...
 $ p.adjust      : num  3.06e-26 3.40e-25 4.82e-25 5.09e-24 3.27e-23 ...
 $ qvalue        : num  6.81e-27 7.55e-26 1.07e-25 1.13e-24 7.26e-24 ...
 $ geneID        : chr  "MT-ATP6/MT-ATP8/MT-CO1/MT-CO2/MT-CO3/MT-CYB/MT-ND1/MT-ND2/MT-ND3/MT-ND4/MT-ND4L/MT-ND5/MT-ND6" "MT-ATP6/MT-ATP8/MT-CO1/MT-CO2/MT-CO3/MT-CYB/MT-ND1/MT-ND2/MT-ND3/MT-ND4/MT-ND4L/MT-ND5/MT-ND6" "MT-ATP6/MT-ATP8/MT-CO1/MT-CO2/MT-CO3/MT-CYB/MT-ND1/MT-ND2/MT-ND3/MT-ND4/MT-ND4L/MT-ND5/MT-ND6" "MT-ATP6/MT-ATP8/MT-CO1/MT-CO2/MT-CO3/MT-CYB/MT-ND1/MT-ND2/MT-ND3/MT-ND4/MT-ND4L/MT-ND5/MT-ND6" ...
 $ Count         : int  13 13 13 13 11 11 11 11 13 11 ...
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320 

> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-redhat-linux-gnu
Running under: Fedora Linux 42 (Adams)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Amsterdam
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_3.21.0    AnnotationDbi_1.70.0   IRanges_2.42.0        
[4] S4Vectors_0.46.0       Biobase_2.68.0         BiocGenerics_0.54.0   
[7] generics_0.1.4         clusterProfiler_4.16.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        dplyr_1.1.4             farver_2.1.2           
 [4] blob_1.2.4              R.utils_2.13.0          Biostrings_2.76.0      
 [7] fastmap_1.2.0           lazyeval_0.2.2          digest_0.6.37          
[10] lifecycle_1.0.4         KEGGREST_1.48.0         tidytree_0.4.6         
[13] RSQLite_2.3.11          magrittr_2.0.3          compiler_4.5.0         
[16] rlang_1.1.6             tools_4.5.0             igraph_2.1.4           
[19] data.table_1.17.2       ggtangle_0.0.6          bit_4.6.0              
[22] gson_0.1.0              plyr_1.8.9              RColorBrewer_1.1-3     
[25] aplot_0.2.5             BiocParallel_1.42.0     purrr_1.0.4            
[28] R.oo_1.27.1             grid_4.5.0              GOSemSim_2.34.0        
[31] enrichplot_1.28.2       GO.db_3.21.0            ggplot2_3.5.2          
[34] scales_1.4.0            dichromat_2.0-0.1       cli_3.6.5              
[37] crayon_1.5.3            treeio_1.32.0           ggtree_3.16.0          
[40] httr_1.4.7              reshape2_1.4.4          DBI_1.2.3              
[43] qvalue_2.40.0           ape_5.8-1               cachem_1.1.0           
[46] DOSE_4.2.0              stringr_1.5.1           splines_4.5.0          
[49] parallel_4.5.0          ggplotify_0.1.2         XVector_0.48.0         
[52] yulab.utils_0.2.0       vctrs_0.6.5             Matrix_1.7-3           
[55] jsonlite_2.0.0          gridGraphics_0.5-1      patchwork_1.3.0        
[58] bit64_4.6.0-1           ggrepel_0.9.6           tidyr_1.3.1            
[61] glue_1.8.0              codetools_0.2-20        cowplot_1.1.3          
[64] stringi_1.8.7           gtable_0.3.6            GenomeInfoDb_1.44.0    
[67] UCSC.utils_1.4.0        tibble_3.2.1            pillar_1.10.2          
[70] fgsea_1.34.0            GenomeInfoDbData_1.2.14 R6_2.6.1               
[73] lattice_0.22-7          R.methodsS3_1.8.2       png_0.1-8              
[76] memoise_2.0.1           ggfun_0.1.8             Rcpp_1.0.14            
[79] fastmatch_1.1-6         nlme_3.1-168            fs_1.6.6               
[82] pkgconfig_2.0.3        
> 

guidohooiveld avatar May 13 '25 19:05 guidohooiveld