clusterProfiler
clusterProfiler copied to clipboard
Differences in gseGO and compareCluster (clusterProfiler) results
Dear all,
I have been using the gseGO function from clusterProfiler to perform gene set enrichment analysis with a ranked list of genes. However, when I try to reproduce the analysis using compareCluster, passing the same gene list as a single cluster and specifying fun = "gseGO", I obtain similar but not identical results. Some Gene Ontology terms appear in one result but not in the other, and vice versa.
Here is the code I used: Using gseGO
gse <- gseGO(geneList = gene_list,
ont = "ALL",
keyType = "ENSEMBL",
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
verbose = TRUE,
OrgDb = "org.Hs.eg.db",
pAdjustMethod = "fdr")
Using compareCluster
compareCluster(geneClusters = list(a1 = gene_list),
pvalueCutoff = 0.05,
ont = "ALL",
keyType = "ENSEMBL",
OrgDb = "org.Hs.eg.db",
fun = "gseGO",
minGSSize = 3,
maxGSSize = 800,
verbose = TRUE,
pAdjustMethod = "fdr")
Has anyone encountered this issue before? What could explain the discrepancies between the two approaches?
Thank you in advance for your insights!
Best regards,
Yes, it is expected that results between 2 runs are slightly different.
Note that clusterProfiler uses under the hood the function fgseaMultilevel (from the package fgsea) for GSEA-based analysis. The advantage of fgseaMultilevel is that it calculates exact p-values. Yet, within the function fgseaMultilevel some resampling of input data is performed to tune the function. Since this resampling is random, this will results in slightly differing results.
To prevent this from happening, thus to remove the randomness, a seed can be set and used.
Some code to illustrate this (using the example input data):
> library(clusterProfiler)
> library(org.Hs.eg.db)
> library(enrichplot)
>
> ## example list of genes
> data(geneList, package="DOSE")
>
>
> ## regular, random analysis
>
> res.gse <- gseGO(geneList = geneList,
+ ont = "ALL",
+ keyType = "ENTREZID",
+ eps = 0,
+ minGSSize = 3,
+ maxGSSize = 800,
+ pvalueCutoff = 0.05,
+ verbose = TRUE,
+ OrgDb = "org.Hs.eg.db",
+ pAdjustMethod = "fdr")
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
>
> res.cc <- compareCluster(geneClusters = list(a1 = geneList),
+ fun = "gseGO",
+ ont = "ALL",
+ keyType = "ENTREZID",
+ eps = 0,
+ minGSSize = 3,
+ maxGSSize = 800,
+ pvalueCutoff = 0.05,
+ verbose = TRUE,
+ OrgDb = "org.Hs.eg.db",
+ pAdjustMethod = "fdr")
>
>
> ## compare results
> res.gse
#
# Gene Set Enrichment Analysis
#
#...@organism Homo sapiens
#...@setType GOALL
#...@keytype ENTREZID
#...@geneList Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...nPerm
#...pvalues adjusted by 'fdr' with cutoff <0.05
#...890 enriched terms found
'data.frame': 890 obs. of 12 variables:
$ ONTOLOGY : chr "BP" "BP" "BP" "BP" ...
$ ID : chr "GO:0051276" "GO:0098813" "GO:0007059" "GO:1903047" ...
$ Description : chr "chromosome organization" "nuclear chromosome segregation" "chromosome segregation" "mitotic cell cycle process" ...
$ setSize : int 482 244 324 616 732 191 158 331 367 229 ...
$ enrichmentScore: num 0.521 0.629 0.583 0.469 0.443 ...
$ NES : num 2.57 2.89 2.74 2.36 2.25 ...
$ pvalue : num 3.95e-32 2.94e-30 2.16e-30 8.86e-29 1.11e-28 ...
$ p.adjust : num 5.55e-28 1.38e-26 1.38e-26 3.11e-25 3.12e-25 ...
$ qvalue : num 4.79e-28 1.19e-26 1.19e-26 2.69e-25 2.69e-25 ...
$ rank : num 1374 449 1374 1257 1264 ...
$ leading_edge : chr "tags=24%, list=11%, signal=22%" "tags=23%, list=4%, signal=22%" "tags=27%, list=11%, signal=25%" "tags=22%, list=10%, signal=21%" ...
$ core_enrichment: chr "8318/55143/991/9493/1062/4605/10403/7153/23397/9787/11065/55355/220134/51203/22974/10460/4751/55839/983/4085/98"| __truncated__ "55143/991/9493/1062/4605/9133/10403/7153/23397/259266/9787/11065/220134/51203/22974/10460/4751/983/4085/81930/8"| __truncated__ "55143/991/9493/1062/4605/9133/10403/7153/23397/259266/9787/11065/55355/220134/51203/22974/10460/4751/79019/5583"| __truncated__ "8318/55143/991/2305/9493/1062/4605/9833/9133/10403/23397/6241/55165/9787/11065/220134/51203/22974/10460/4751/27"| __truncated__ ...
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
>
> res.cc
#
# Result of Comparing 1 gene clusters
#
#.. @fun gseGO
#.. @geneClusters List of 1
$ a1: Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
..- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...Result 'data.frame': 875 obs. of 13 variables:
$ Cluster : Factor w/ 1 level "a1": 1 1 1 1 1 1 1 1 1 1 ...
$ ONTOLOGY : chr "BP" "BP" "BP" "BP" ...
$ ID : chr "GO:0051276" "GO:0007059" "GO:0098813" "GO:0000278" ...
$ Description : chr "chromosome organization" "chromosome segregation" "nuclear chromosome segregation" "mitotic cell cycle" ...
$ setSize : int 482 324 244 732 616 191 158 331 319 138 ...
$ enrichmentScore: num 0.521 0.583 0.629 0.443 0.469 ...
$ NES : num 2.57 2.77 2.88 2.25 2.34 ...
$ pvalue : num 3.57e-32 5.36e-31 2.08e-30 1.59e-28 2.01e-28 ...
$ p.adjust : num 5.02e-28 3.77e-27 9.76e-27 5.58e-25 5.66e-25 ...
$ qvalue : num 4.33e-28 3.25e-27 8.43e-27 4.82e-25 4.88e-25 ...
$ rank : num 1374 1374 449 1264 1257 ...
$ leading_edge : chr "tags=24%, list=11%, signal=22%" "tags=27%, list=11%, signal=25%" "tags=23%, list=4%, signal=22%" "tags=21%, list=10%, signal=20%" ...
$ core_enrichment: chr "8318/55143/991/9493/1062/4605/10403/7153/23397/9787/11065/55355/220134/51203/22974/10460/4751/55839/983/4085/98"| __truncated__ "55143/991/9493/1062/4605/9133/10403/7153/23397/259266/9787/11065/55355/220134/51203/22974/10460/4751/79019/5583"| __truncated__ "55143/991/9493/1062/4605/9133/10403/7153/23397/259266/9787/11065/220134/51203/22974/10460/4751/983/4085/81930/8"| __truncated__ "8318/55143/991/2305/9493/1062/4605/9833/9133/10403/23397/79733/6241/55165/9787/11065/220134/55872/51203/22974/1"| __truncated__ ...
#.. number of enriched terms found for each gene cluster:
#.. a1: 875
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou,
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
The Innovation. 2021, 2(3):100141
>
> ## plot p-values
> ## note the small differences that exist
>
> merged.res <- merge( as.data.frame(res.gse), as.data.frame(res.cc), by.x="ID", by.y="ID")
> plot( merged.res$pvalue.x, merged.res$pvalue.y )
>
>
> ## redo, but using a seed for reproducible results!
> ## note the argument 'seed = TRUE' when both functions are run.
>
> set.seed(1234)
>
> res.gse2 <- gseGO(geneList = geneList,
+ ont = "ALL",
+ keyType = "ENTREZID",
+ eps = 0,
+ minGSSize = 3,
+ maxGSSize = 800,
+ pvalueCutoff = 0.05,
+ verbose = TRUE,
+ OrgDb = "org.Hs.eg.db",
+ pAdjustMethod = "fdr",
+ seed = TRUE)
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
>
> res.cc2 <- compareCluster(geneClusters = list(a1 = geneList),
+ fun = "gseGO",
+ ont = "ALL",
+ keyType = "ENTREZID",
+ eps = 0,
+ minGSSize = 3,
+ maxGSSize = 800,
+ pvalueCutoff = 0.05,
+ verbose = TRUE,
+ OrgDb = "org.Hs.eg.db",
+ pAdjustMethod = "fdr",
+ seed = TRUE)
>
>
> res.gse2
#
# Gene Set Enrichment Analysis
#
#...@organism Homo sapiens
#...@setType GOALL
#...@keytype ENTREZID
#...@geneList Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...nPerm
#...pvalues adjusted by 'fdr' with cutoff <0.05
#...890 enriched terms found
'data.frame': 890 obs. of 12 variables:
$ ONTOLOGY : chr "BP" "BP" "BP" "BP" ...
$ ID : chr "GO:0051276" "GO:0098813" "GO:0007059" "GO:1903047" ...
$ Description : chr "chromosome organization" "nuclear chromosome segregation" "chromosome segregation" "mitotic cell cycle process" ...
$ setSize : int 482 244 324 616 732 191 158 331 367 229 ...
$ enrichmentScore: num 0.521 0.629 0.583 0.469 0.443 ...
$ NES : num 2.57 2.89 2.74 2.36 2.25 ...
$ pvalue : num 3.95e-32 2.94e-30 2.16e-30 8.86e-29 1.11e-28 ...
$ p.adjust : num 5.55e-28 1.38e-26 1.38e-26 3.11e-25 3.12e-25 ...
$ qvalue : num 4.79e-28 1.19e-26 1.19e-26 2.69e-25 2.69e-25 ...
$ rank : num 1374 449 1374 1257 1264 ...
$ leading_edge : chr "tags=24%, list=11%, signal=22%" "tags=23%, list=4%, signal=22%" "tags=27%, list=11%, signal=25%" "tags=22%, list=10%, signal=21%" ...
$ core_enrichment: chr "8318/55143/991/9493/1062/4605/10403/7153/23397/9787/11065/55355/220134/51203/22974/10460/4751/55839/983/4085/98"| __truncated__ "55143/991/9493/1062/4605/9133/10403/7153/23397/259266/9787/11065/220134/51203/22974/10460/4751/983/4085/81930/8"| __truncated__ "55143/991/9493/1062/4605/9133/10403/7153/23397/259266/9787/11065/55355/220134/51203/22974/10460/4751/79019/5583"| __truncated__ "8318/55143/991/2305/9493/1062/4605/9833/9133/10403/23397/6241/55165/9787/11065/220134/51203/22974/10460/4751/27"| __truncated__ ...
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
>
> res.cc2
#
# Result of Comparing 1 gene clusters
#
#.. @fun gseGO
#.. @geneClusters List of 1
$ a1: Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
..- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...Result 'data.frame': 890 obs. of 13 variables:
$ Cluster : Factor w/ 1 level "a1": 1 1 1 1 1 1 1 1 1 1 ...
$ ONTOLOGY : chr "BP" "BP" "BP" "BP" ...
$ ID : chr "GO:0051276" "GO:0098813" "GO:0007059" "GO:1903047" ...
$ Description : chr "chromosome organization" "nuclear chromosome segregation" "chromosome segregation" "mitotic cell cycle process" ...
$ setSize : int 482 244 324 616 732 191 158 331 367 229 ...
$ enrichmentScore: num 0.521 0.629 0.583 0.469 0.443 ...
$ NES : num 2.57 2.89 2.74 2.36 2.25 ...
$ pvalue : num 3.95e-32 2.94e-30 2.16e-30 8.86e-29 1.11e-28 ...
$ p.adjust : num 5.55e-28 1.38e-26 1.38e-26 3.11e-25 3.12e-25 ...
$ qvalue : num 4.79e-28 1.19e-26 1.19e-26 2.69e-25 2.69e-25 ...
$ rank : num 1374 449 1374 1257 1264 ...
$ leading_edge : chr "tags=24%, list=11%, signal=22%" "tags=23%, list=4%, signal=22%" "tags=27%, list=11%, signal=25%" "tags=22%, list=10%, signal=21%" ...
$ core_enrichment: chr "8318/55143/991/9493/1062/4605/10403/7153/23397/9787/11065/55355/220134/51203/22974/10460/4751/55839/983/4085/98"| __truncated__ "55143/991/9493/1062/4605/9133/10403/7153/23397/259266/9787/11065/220134/51203/22974/10460/4751/983/4085/81930/8"| __truncated__ "55143/991/9493/1062/4605/9133/10403/7153/23397/259266/9787/11065/55355/220134/51203/22974/10460/4751/79019/5583"| __truncated__ "8318/55143/991/2305/9493/1062/4605/9833/9133/10403/23397/6241/55165/9787/11065/220134/51203/22974/10460/4751/27"| __truncated__ ...
#.. number of enriched terms found for each gene cluster:
#.. a1: 890
#
#...Citation
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou,
W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
The Innovation. 2021, 2(3):100141
>
> ## plot p-values
> ## note all values are identical!
>
> merged.res2 <- merge( as.data.frame(res.gse2), as.data.frame(res.cc2), by.x="ID", by.y="ID")
> plot( merged.res2$pvalue.x, merged.res2$pvalue.y )
>
>
>
For the archive; other threads on this topic: https://github.com/YuLab-SMU/clusterProfiler/issues/466 https://github.com/YuLab-SMU/clusterProfiler/issues/695
Thank you very much for your explanation. It has been very enlightening, and it will certainly help me add more weight to my results. I truly appreciate your time and insights. Alberto