clusterProfiler
clusterProfiler copied to clipboard
How to run enrichment analysis of metabolites using clusterProfier
I have a list of metabolites, In the form of metabolite names, and also in the form of KEGG Ids. How can i run enrichment analysis on this data ? Please advise
Example data:
name keggid
Pyruvate C00022
Acetyl-CoA C00024
2OG C00026
Glycine C00037
Succinate C00042
Aspartate C00049
A pointer is given in, for example, the protocol recently published in Nature Methods, See section 1.9 on page 16. https://doi.org/10.1038/s41596-024-01020-z
In essence you can use the enrichKEGG function (since you have a list of KEGG ids), and use it with organism = "cpd" (because the input are 'compounds'.
> library(clusterProfiler)
>
> input <- data.frame("name"=c("Pyruvate","Acetyl-CoA","2OG","Glycine","Succinate","Aspartate"),
+ "keggid"=c("C00022","C00024","C00026","C00037","C00042","C00049") )
>
> ## run over-representation analysis.
> ## note not cutoffs are applied for significanc and minimum number of compunds.
> cpd_enrich_result <- enrichKEGG(input[,"keggid"],
+ organism = "cpd",
+ minGSSize=1,
+ pvalueCutoff=1)
Reading KEGG annotation online: "https://rest.kegg.jp/link/cpd/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway"...
> cpd_enrich_result
#
# over-representation test
#
#...@organism KEGG Compound
#...@ontology KEGG
#...@keytype kegg_compound
#...@gene chr [1:6] "C00022" "C00024" "C00026" "C00037" "C00042" "C00049"
#...pvalues adjusted by 'BH' with cutoff <1
#...103 enriched terms found
'data.frame': 103 obs. of 14 variables:
$ category : chr "Human Diseases" "Metabolism" "Metabolism" "Metabolism" ...
$ subcategory : chr "Cancer: overview" "Global and overview maps" "Chemical structure transformation maps" "Carbohydrate metabolism" ...
$ ID : chr "map05230" "map01200" "map01060" "map00630" ...
$ Description : chr "Central carbon metabolism in cancer" "Carbon metabolism" "Biosynthesis of plant secondary metabolites" "Glyoxylate and dicarboxylate metabolism" ...
$ GeneRatio : chr "6/6" "6/6" "6/6" "5/6" ...
$ BgRatio : chr "37/6500" "112/6500" "141/6500" "64/6500" ...
$ RichFactor : num 0.1622 0.0536 0.0426 0.0781 0.0746 ...
$ FoldEnrichment: num 175.7 58 46.1 84.6 80.8 ...
$ zScore : num 32.4 18.5 16.5 20.4 20 ...
$ pvalue : num 2.22e-14 2.29e-11 9.38e-11 4.70e-10 5.95e-10 ...
$ p.adjust : num 2.29e-12 1.18e-09 3.22e-09 1.10e-08 1.10e-08 ...
$ qvalue : num 2.11e-13 1.08e-10 2.96e-10 1.01e-09 1.01e-09 ...
$ geneID : chr "C00022/C00024/C00026/C00037/C00042/C00049" "C00022/C00024/C00026/C00037/C00042/C00049" "C00022/C00024/C00026/C00037/C00042/C00049" "C00022/C00024/C00026/C00037/C00042" ...
$ Count : int 6 6 6 5 5 5 4 4 4 4 ...
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, doi:10.1038/s41596-024-01020-z
>
> as.data.frame(cpd_enrich_result)[1:3, ]
category subcategory ID
map05230 Human Diseases Cancer: overview map05230
map01200 Metabolism Global and overview maps map01200
map01060 Metabolism Chemical structure transformation maps map01060
Description GeneRatio BgRatio
map05230 Central carbon metabolism in cancer 6/6 37/6500
map01200 Carbon metabolism 6/6 112/6500
map01060 Biosynthesis of plant secondary metabolites 6/6 141/6500
RichFactor FoldEnrichment zScore pvalue p.adjust
map05230 0.16216216 175.67568 32.38612 2.224526e-14 2.291262e-12
map01200 0.05357143 58.03571 18.50615 2.289233e-11 1.178955e-09
map01060 0.04255319 46.09929 16.45613 9.376114e-11 3.219132e-09
qvalue geneID Count
map05230 2.107446e-13 C00022/C00024/C00026/C00037/C00042/C00049 6
map01200 1.084374e-10 C00022/C00024/C00026/C00037/C00042/C00049 6
map01060 2.960878e-10 C00022/C00024/C00026/C00037/C00042/C00049 6
>
> dotplot(cpd_enrich_result)
>
I tried to run this using a set of compound IDs, and I got an error. I also get the same error using your example code. , but if I run enrichKEGG using gene ID, it works fine, suggesting that the KEGG website is okay, but the cpd IDs don't map.
keggChem [1:13] [1] "C00130" "C00946" "C02990" "C01762" "C01081" "C00315" "C00750" "C00882" [9] "C05385" "C16300" "C03406" "C00357" "C05282"
kk <- enrichKEGG(gene = keggChem[1:10], organism = 'cpd', pvalueCutoff = 0.05)
Error in content[, 1] : subscript out of bounds
got an error. I also get the same error using your example code. , but if I run enrichKEGG using gene ID, it works fine, suggesting that the KEGG website is okay, but the cpd IDs don't map.
Hello @lungboyz - thanks for sharing. Would it be possible to share an both - example with error , and the example that worked, so its easier to compare and understand ? Much appreciated
@lungboyz
When using your 13 input ids it is working for me.... For testing purposes I would set all significance cutoffs to 1 (like I do below; so that no filtering is occurring). You also may need to use the latest version of clusterProfiler.
> library(clusterProfiler)
>
> keggChem <- c("C00130","C00946","C02990","C01762","C01081","C00315","C00750",
+ "C00882","C05385","C16300","C03406","C00357","C05282")
>
> cpd_enrich_result <- enrichKEGG(keggChem ,
+ organism = "cpd",
+ minGSSize=1,
+ pvalueCutoff=1)
Reading KEGG annotation online: "https://rest.kegg.jp/link/cpd/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway"...
>
> cpd_enrich_result
#
# over-representation test
#
#...@organism KEGG Compound
#...@ontology KEGG
#...@keytype kegg_compound
#...@gene chr [1:13] "C00130" "C00946" "C02990" "C01762" "C01081" "C00315" "C00750" ...
#...pvalues adjusted by 'BH' with cutoff <1
#...29 enriched terms found
'data.frame': 29 obs. of 14 variables:
$ category : chr "Metabolism" "Metabolism" "Metabolism" "Metabolism" ...
$ subcategory : chr "Metabolism of cofactors and vitamins" "Metabolism of other amino acids" "Chemical structure transformation maps" "Global and overview maps" ...
$ ID : chr "map00770" "map00410" "map01065" "map01240" ...
$ Description : chr "Pantothenate and CoA biosynthesis" "beta-Alanine metabolism" "Biosynthesis of alkaloids derived from histidine and purine" "Biosynthesis of cofactors" ...
$ GeneRatio : chr "2/11" "2/11" "2/11" "4/11" ...
$ BgRatio : chr "30/6502" "32/6502" "35/6502" "328/6502" ...
$ RichFactor : num 0.0667 0.0625 0.0571 0.0122 0.0526 ...
$ FoldEnrichment: num 39.41 36.94 33.78 7.21 31.11 ...
$ zScore : num 8.68 8.39 8 4.75 7.66 ...
$ pvalue : num 0.0011 0.00126 0.0015 0.00158 0.00177 ...
$ p.adjust : num 0.0103 0.0103 0.0103 0.0103 0.0103 ...
$ qvalue : num 0.00447 0.00447 0.00447 0.00447 0.00447 ...
$ geneID : chr "C00750/C00882" "C00315/C00750" "C00130/C01762" "C00130/C01081/C00750/C00882" ...
$ Count : int 2 2 2 4 2 2 2 2 2 2 ...
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
>
> as.data.frame(cpd_enrich_result)[1:3,]
category subcategory ID
map00770 Metabolism Metabolism of cofactors and vitamins map00770
map00410 Metabolism Metabolism of other amino acids map00410
map01065 Metabolism Chemical structure transformation maps map01065
Description GeneRatio
map00770 Pantothenate and CoA biosynthesis 2/11
map00410 beta-Alanine metabolism 2/11
map01065 Biosynthesis of alkaloids derived from histidine and purine 2/11
BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust
map00770 30/6502 0.06666667 39.40606 8.679041 0.001103125 0.01026385
map00410 32/6502 0.06250000 36.94318 8.390155 0.001255496 0.01026385
map01065 35/6502 0.05714286 33.77662 8.003453 0.001501922 0.01026385
qvalue geneID Count
map00770 0.004470644 C00750/C00882 2
map00410 0.004470644 C00315/C00750 2
map01065 0.004470644 C00130/C01762 2
>
>
> packageVersion("clusterProfiler")
[1] ‘4.14.4’
>