clusterProfiler
clusterProfiler copied to clipboard
Mitochondrial Genes and Annotations missing from EnrichGO
Hi Great tool thanks for maintaining it.
i have a question that relates to the lack of mitochondrial genome encoded genes and their corresponding ontologies.
In the following code:
hgnc <- c("MT-ATP6", "MT-ATP8", "MT-CO1", "MT-CO2", "MT-CO3", "MT-CYB", "MT-ND1", "MT-ND2", "MT-ND3", "MT-ND4", "MT-ND4L", "MT-ND5", "MT-ND6", "MT-RNR1", "MT-RNR2")
# Run GO enrichment analysis genes up
GO.up <- enrichGO(gene = hgnc ,
universe = all_genes,
keyType = "SYMBOL", #keytypes(org.Hs.eg.db)
OrgDb = org.Hs.eg.db,
ont = c("ALL"),
pAdjustMethod = "BH",
pvalueCutoff = 1,
qvalueCutoff = 1, #if not set default is at 0.05
readable = TRUE)
I get a NULL, which is impossible because all these terms are part of multiple mitochondrial/ETC component annotations. I noticed that this could be a symbol issue, but upon using shorthand symbols (which is incorrect they should be HNGC) like ATP6, ATP8 etc, I found that the answer was still NULL.
However this could be a GO 2023 problem. If so, can one utilize older GO like from 2019 or 2018? If so how?
Thanks a lot.!
🐉
I believe the issue is that the hgnc symbols are not recognized c.q. compatible with the org.Hs.eg.db annotation, which is NCBI-based.
If the corresponding ENTREZID are used, GO terms are returned...
> library(org.Hs.eg.db)
>
> hgnc <- c("MT-ATP6", "MT-ATP8", "MT-CO1", "MT-CO2", "MT-CO3", "MT-CYB", "MT-ND1", "MT-ND2",
+ "MT-ND3", "MT-ND4", "MT-ND4L", "MT-ND5", "MT-ND6", "MT-RNR1", "MT-RNR2")
>
> ## nothing is found
> AnnotationDbi:::select(org.Hs.eg.db, keys = hgnc, keytype = "SYMBOL",
+ columns = c("ENTREZID", "SYMBOL", "GENENAME") )
Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'SYMBOL'. Please use the keys method to see a listing of valid arguments.
>
> ## try ALIAS
> ## almost nothing is found
> AnnotationDbi:::select(org.Hs.eg.db, keys = hgnc, keytype = "ALIAS",
+ columns = c("ENTREZID", "SYMBOL", "GENENAME") )
'select()' returned 1:1 mapping between keys and columns
ALIAS ENTREZID SYMBOL GENENAME
1 MT-ATP6 <NA> <NA> <NA>
2 MT-ATP8 <NA> <NA> <NA>
3 MT-CO1 <NA> <NA> <NA>
4 MT-CO2 107075310 MTCO2P12 MT-CO2 pseudogene 12
5 MT-CO3 <NA> <NA> <NA>
6 MT-CYB <NA> <NA> <NA>
7 MT-ND1 <NA> <NA> <NA>
8 MT-ND2 <NA> <NA> <NA>
9 MT-ND3 <NA> <NA> <NA>
10 MT-ND4 <NA> <NA> <NA>
11 MT-ND4L <NA> <NA> <NA>
12 MT-ND5 <NA> <NA> <NA>
13 MT-ND6 <NA> <NA> <NA>
14 MT-RNR1 <NA> <NA> <NA>
15 MT-RNR2 <NA> <NA> <NA>
>
>
> ## manually look up your first 2 entries:
> ## MT-ATP6 has ENTREZID 4508 (https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7414)
> ## MT-ATP8 has ENTREZID 4509 (https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:7415)
>
>
> AnnotationDbi:::select(org.Hs.eg.db, keys = c("4508","4509"), keytype = "ENTREZID",
+ columns = c("ENTREZID", "SYMBOL", "ALIAS", "GENENAME", "GOALL") ) [c(1:4, 1808:1812) ,]
'select()' returned 1:many mapping between keys and columns
ENTREZID SYMBOL ALIAS GENENAME GOALL EVIDENCEALL
1 4508 ATP6 ATPase6 ATP synthase F0 subunit 6 GO:0003674 IBA
2 4508 ATP6 ATPase6 ATP synthase F0 subunit 6 GO:0003674 IDA
3 4508 ATP6 ATPase6 ATP synthase F0 subunit 6 GO:0003674 IPI
4 4508 ATP6 ATPase6 ATP synthase F0 subunit 6 GO:0003824 IBA
1808 4509 ATP8 ATP8 ATP synthase F0 subunit 8 GO:1902600 IEA
1809 4509 ATP8 ATP8 ATP synthase F0 subunit 8 GO:1904949 IBA
1810 4509 ATP8 ATP8 ATP synthase F0 subunit 8 GO:1904949 IDA
1811 4509 ATP8 ATP8 ATP synthase F0 subunit 8 GO:1904949 IEA
1812 4509 ATP8 ATP8 ATP synthase F0 subunit 8 GO:1904949 NAS
ONTOLOGYALL
1 MF
2 MF
3 MF
4 MF
1808 BP
1809 CC
1810 CC
1811 CC
1812 CC
>
>
> packageVersion("org.Hs.eg.db")
[1] ‘3.19.1’
>
LATER ADDED:
Yet, when you check the NCBI gene page for the first entry (4508) (here) the official symbol is MT-ATP6!
Mmm...?? Maybe something to report on the Bioconductor support forum?
For the archive: I have reported this issue at the Bioconductor support site, and it turns out that in the OrgDb the 'default' names and symbols are used, and not the 'official' ones... This will be rectified, but when exactly this is not known at the moment.
See: https://support.bioconductor.org/p/9160426/
It seems like NCBI wont be making the change any time soon though, but at least we now know what the problem is. I transitioned to g:GOSt which doesn't have this problem.
For the archive: since Bioconductor release 3.21 the official symbols are used in org.Hs.eg.db.
The code in the 1st post thus works.
> library(org.Hs.eg.db)
> library(clusterProfiler)
> library(org.Hs.eg.db)
> hgnc <- c("MT-ATP6", "MT-ATP8", "MT-CO1", "MT-CO2", "MT-CO3", "MT-CYB",
"MT-ND1", "MT-ND2", "MT-ND3", "MT-ND4", "MT-ND4L", "MT-ND5", "MT-ND6",
"MT-RNR1", "MT-RNR2")
> GO.up <- enrichGO(gene = hgnc ,
keyType = "SYMBOL",
OrgDb = org.Hs.eg.db,
ont = c("ALL"),
pAdjustMethod = "BH",
pvalueCutoff = 1,
qvalueCutoff = 1,
readable = TRUE)
> GO.up
#
#...@organism Homo sapiens
#...@ontology GOALL
#...@keytype SYMBOL
#...@gene chr [1:15] "MT-ATP6" "MT-ATP8" "MT-CO1" "MT-CO2" "MT-CO3" "MT-CYB" ...
#...pvalues adjusted by 'BH' with cutoff <1
#...125 enriched terms found
'data.frame': 125 obs. of 13 variables:
$ ONTOLOGY : chr "BP" "BP" "BP" "BP" ...
$ ID : chr "GO:0006119" "GO:1902600" "GO:0009060" "GO:0045333" ...
$ Description : chr "oxidative phosphorylation" "proton transmembrane transport" "aerobic respiration" "cellular respiration" ...
$ GeneRatio : chr "13/13" "13/13" "13/13" "13/13" ...
$ BgRatio : chr "151/18805" "190/18805" "201/18805" "245/18805" ...
$ RichFactor : num 0.0861 0.0684 0.0647 0.0531 0.1183 ...
$ FoldEnrichment: num 124.5 99 93.6 76.8 171.1 ...
$ zScore : num 40.1 35.7 34.7 31.4 43.2 ...
$ pvalue : num 3.41e-28 7.55e-27 1.61e-26 2.26e-25 1.82e-24 ...
$ p.adjust : num 3.06e-26 3.40e-25 4.82e-25 5.09e-24 3.27e-23 ...
$ qvalue : num 6.81e-27 7.55e-26 1.07e-25 1.13e-24 7.26e-24 ...
$ geneID : chr "MT-ATP6/MT-ATP8/MT-CO1/MT-CO2/MT-CO3/MT-CYB/MT-ND1/MT-ND2/MT-ND3/MT-ND4/MT-ND4L/MT-ND5/MT-ND6" "MT-ATP6/MT-ATP8/MT-CO1/MT-CO2/MT-CO3/MT-CYB/MT-ND1/MT-ND2/MT-ND3/MT-ND4/MT-ND4L/MT-ND5/MT-ND6" "MT-ATP6/MT-ATP8/MT-CO1/MT-CO2/MT-CO3/MT-CYB/MT-ND1/MT-ND2/MT-ND3/MT-ND4/MT-ND4L/MT-ND5/MT-ND6" "MT-ATP6/MT-ATP8/MT-CO1/MT-CO2/MT-CO3/MT-CYB/MT-ND1/MT-ND2/MT-ND3/MT-ND4/MT-ND4L/MT-ND5/MT-ND6" ...
$ Count : int 13 13 13 13 11 11 11 11 13 11 ...
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-redhat-linux-gnu
Running under: Fedora Linux 42 (Adams)
Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Amsterdam
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] org.Hs.eg.db_3.21.0 AnnotationDbi_1.70.0 IRanges_2.42.0
[4] S4Vectors_0.46.0 Biobase_2.68.0 BiocGenerics_0.54.0
[7] generics_0.1.4 clusterProfiler_4.16.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.2
[4] blob_1.2.4 R.utils_2.13.0 Biostrings_2.76.0
[7] fastmap_1.2.0 lazyeval_0.2.2 digest_0.6.37
[10] lifecycle_1.0.4 KEGGREST_1.48.0 tidytree_0.4.6
[13] RSQLite_2.3.11 magrittr_2.0.3 compiler_4.5.0
[16] rlang_1.1.6 tools_4.5.0 igraph_2.1.4
[19] data.table_1.17.2 ggtangle_0.0.6 bit_4.6.0
[22] gson_0.1.0 plyr_1.8.9 RColorBrewer_1.1-3
[25] aplot_0.2.5 BiocParallel_1.42.0 purrr_1.0.4
[28] R.oo_1.27.1 grid_4.5.0 GOSemSim_2.34.0
[31] enrichplot_1.28.2 GO.db_3.21.0 ggplot2_3.5.2
[34] scales_1.4.0 dichromat_2.0-0.1 cli_3.6.5
[37] crayon_1.5.3 treeio_1.32.0 ggtree_3.16.0
[40] httr_1.4.7 reshape2_1.4.4 DBI_1.2.3
[43] qvalue_2.40.0 ape_5.8-1 cachem_1.1.0
[46] DOSE_4.2.0 stringr_1.5.1 splines_4.5.0
[49] parallel_4.5.0 ggplotify_0.1.2 XVector_0.48.0
[52] yulab.utils_0.2.0 vctrs_0.6.5 Matrix_1.7-3
[55] jsonlite_2.0.0 gridGraphics_0.5-1 patchwork_1.3.0
[58] bit64_4.6.0-1 ggrepel_0.9.6 tidyr_1.3.1
[61] glue_1.8.0 codetools_0.2-20 cowplot_1.1.3
[64] stringi_1.8.7 gtable_0.3.6 GenomeInfoDb_1.44.0
[67] UCSC.utils_1.4.0 tibble_3.2.1 pillar_1.10.2
[70] fgsea_1.34.0 GenomeInfoDbData_1.2.14 R6_2.6.1
[73] lattice_0.22-7 R.methodsS3_1.8.2 png_0.1-8
[76] memoise_2.0.1 ggfun_0.1.8 Rcpp_1.0.14
[79] fastmatch_1.1-6 nlme_3.1-168 fs_1.6.6
[82] pkgconfig_2.0.3
>