enrichplot icon indicating copy to clipboard operation
enrichplot copied to clipboard

Inclusion and exclusion criteria used by dotplot

Open xinming-an opened this issue 8 months ago • 10 comments

Hi,

I encountered an issue where some pathways that were passed the significant cutoffs (adjusted p<0.05 and qvalue<0.2) were not included in the dotplot. Below includes the dotplot and the top pathways returned by function enrichKEGG. Any inputs on how to address this issue are greatly appreciated. Thank you!

Image

Image

xinming-an avatar Apr 11 '25 17:04 xinming-an

Did you check the help page of the function dotplot? Type ?dotplot

The default number of categories to be plotted is 10, if you would like to plot more categories (sets), increase the value of the argument showCategory (from 10).

guidohooiveld avatar Apr 14 '25 09:04 guidohooiveld

Thank you for the input, guidohooiveld. I understand that the number of categories to be plotted by default is 10 and it can be changed. My question is how it determines the top 10 (or 20) categories, and it does seems to be the adjusted p-value, qvalue or generatio. For example, the barplot uses the adjusted p-value to rank the categories

xinming-an avatar Apr 16 '25 15:04 xinming-an

If you run a gene set analysis, clusterProfiler filters these sets based on significance, and then orders them according to the value of p.adjust. See e.g. these threads (which you also found, I just saw): https://github.com/YuLab-SMU/clusterProfiler/issues/710#issuecomment-2244623389 https://github.com/YuLab-SMU/enrichplot/issues/270#issuecomment-2012958275

Within the function dotplot, the argument x defines which metric to use to 'order' the gene sets (number as defined by showCategory) on the x-axis of the plot. Hopefully you have seen that the default value is x = "GeneRatio". If you would like to use another metric, you should change x. x can be any of colnames(as.data.frame(res)) (see below for code).

> ## load library
> library(clusterProfiler)
> library(enrichplot)
> 
> ## load sample data
> data(geneList, package="DOSE")
> gene <- names(geneList)[abs(geneList) > 2]
> 
> ## perform enrichment analysis
> ## calculate adjusted p-values according to BH,
> ## and set pvalueCutoff to 0.05. 
> res <- enrichKEGG(gene         = gene,
+                   organism     = 'hsa',
+                   pvalueCutoff = 0.05,
+                   pAdjustMethod = "BH")
Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
> 
> ## which columns/metrics are available?
> colnames(as.data.frame(res))
 [1] "category"       "subcategory"    "ID"             "Description"   
 [5] "GeneRatio"      "BgRatio"        "RichFactor"     "FoldEnrichment"
 [9] "zScore"         "pvalue"         "p.adjust"       "qvalue"        
[13] "geneID"         "Count"         
> 
> ## default dotplot (x = "GeneRatio")
> dotplot(res)
> 

Image

> ## order on (for example) zScore
> dotplot(res, x="zScore")
>

Image

> ## order on p.adjust (=FDR)
> dotplot(res, x="p.adjust")
>

Image

> ## note that using such small values for plotting
> ## doesn't look nice from an aesthetics point of view.
> ## Therefore manually add column with -log(p.adjust)
> ## to results.
> 
> res@result$minlog10p.adj <- -log10(res@result$p.adjust)
> 
> dotplot(res, x="minlog10p.adj")
>

Image

> ## To make nice x-axis label
> library(ggplot2)
> dotplot(res, x="minlog10p.adj") + xlab(expression( -log['10']*(FDR) ))
>

Image

guidohooiveld avatar Apr 18 '25 13:04 guidohooiveld

Thank you for the detailed explanation. However, if you compare the dotplot and the table I attached, I wonder why the first pathway in the table, Pathways of neurodegeneration - multiple diseases, was not included in the dotplot. Its generatio is higher than the first pathway in the dotplot, Human papillomavirus infection, and its adjusted pvalue also passed the cutoff of 0.05.

xinming-an avatar Apr 18 '25 13:04 xinming-an

I find your table difficult to read....

Can you paste the output from this line of code? as.data.frame(res)

Using the result object from my analysis:

> as.data.frame(res)
                                     category
hsa04110                   Cellular Processes
hsa04114                   Cellular Processes
hsa04218                   Cellular Processes
hsa04061 Environmental Information Processing
hsa03320                   Organismal Systems
hsa04814                   Cellular Processes
hsa04914                   Organismal Systems
                                 subcategory       ID
hsa04110               Cell growth and death hsa04110
hsa04114               Cell growth and death hsa04114
hsa04218               Cell growth and death hsa04218
hsa04061 Signaling molecules and interaction hsa04061
hsa03320                    Endocrine system hsa03320
hsa04814                       Cell motility hsa04814
hsa04914                    Endocrine system hsa04914
                                                           Description
hsa04110                                                    Cell cycle
hsa04114                                                Oocyte meiosis
hsa04218                                           Cellular senescence
hsa04061 Viral protein interaction with cytokine and cytokine receptor
hsa03320                                        PPAR signaling pathway
hsa04814                                                Motor proteins
hsa04914                       Progesterone-mediated oocyte maturation
         GeneRatio  BgRatio RichFactor FoldEnrichment   zScore       pvalue
hsa04110    15/106 158/8546 0.09493671       7.654048 9.460695 7.860673e-10
hsa04114    10/106 139/8546 0.07194245       5.800190 6.394153 7.902735e-06
hsa04218    10/106 157/8546 0.06369427       5.135200 5.860423 2.306908e-05
hsa04061     8/106 100/8546 0.08000000       6.449811 6.143186 3.110376e-05
hsa03320     7/106  76/8546 0.09210526       7.425770 6.305623 4.007301e-05
hsa04814    10/106 197/8546 0.05076142       4.092520 4.921145 1.571086e-04
hsa04914     7/106 111/8546 0.06306306       5.084311 4.853728 4.367132e-04
             p.adjust       qvalue
hsa04110 1.713627e-07 1.696251e-07
hsa04114 8.613982e-04 8.526636e-04
hsa04218 1.676353e-03 1.659355e-03
hsa04061 1.695155e-03 1.677966e-03
hsa03320 1.747183e-03 1.729467e-03
hsa04814 5.708278e-03 5.650395e-03
hsa04914 1.360050e-02 1.346259e-02
                                                                           geneID
hsa04110 8318/991/9133/10403/890/983/4085/81620/7272/9212/1111/9319/891/4174/9232
hsa04114                          991/9133/983/4085/51806/6790/891/9232/3708/5241
hsa04218                           2305/4605/9133/890/983/51806/1111/891/776/3708
hsa04061                                 3627/10563/6373/4283/6362/6355/9547/1524
hsa03320                                       4312/9415/9370/5105/2167/3158/5346
hsa04814                   9493/1062/81930/3832/3833/146909/10112/24137/4629/7802
hsa04914                                          9133/890/983/4085/6790/891/5241
         Count minlog10p.adj
hsa04110    15      6.766084
hsa04114    10      3.064796
hsa04218    10      2.775635
hsa04061     8      2.770791
hsa03320     7      2.757662
hsa04814    10      2.243495
hsa04914     7      1.866445
> 

guidohooiveld avatar Apr 18 '25 14:04 guidohooiveld

Sorry, it would be better to show the output from the 3 lines of code below. Note that I save the dotplot to object p. This has the advantage that all relevant information is present, which allows much easy viewing!

> p <- dotplot(res)
> p$data[order(p$data$GeneRatio, decreasing = TRUE)[1:10], ]
> print(p)

Output: (note that 3x NA are present (because only 7 sets are significant, but the first 10 should be returned...)

                                     category                         subcategory       ID
hsa04110                   Cellular Processes               Cell growth and death hsa04110
hsa04114                   Cellular Processes               Cell growth and death hsa04114
hsa04218                   Cellular Processes               Cell growth and death hsa04218
hsa04814                   Cellular Processes                       Cell motility hsa04814
hsa04061 Environmental Information Processing Signaling molecules and interaction hsa04061
hsa03320                   Organismal Systems                    Endocrine system hsa03320
hsa04914                   Organismal Systems                    Endocrine system hsa04914
NA                                       <NA>                                <NA>     <NA>
NA.1                                     <NA>                                <NA>     <NA>
NA.2                                     <NA>                                <NA>     <NA>
                                                           Description  GeneRatio     BgRatio
hsa04110                                                    Cell cycle 0.14150943 0.018488182
hsa04114                                                Oocyte meiosis 0.09433962 0.016264919
hsa04218                                           Cellular senescence 0.09433962 0.018371168
hsa04814                                                Motor proteins 0.09433962 0.023051720
hsa04061 Viral protein interaction with cytokine and cytokine receptor 0.07547170 0.011701381
hsa03320                                        PPAR signaling pathway 0.06603774 0.008893049
hsa04914                       Progesterone-mediated oocyte maturation 0.06603774 0.012988533
NA                                                                <NA>         NA          NA
NA.1                                                              <NA>         NA          NA
NA.2                                                              <NA>         NA          NA
         RichFactor FoldEnrichment   zScore       pvalue     p.adjust       qvalue
hsa04110 0.09493671       7.654048 9.460695 7.860673e-10 1.713627e-07 1.696251e-07
hsa04114 0.07194245       5.800190 6.394153 7.902735e-06 8.613982e-04 8.526636e-04
hsa04218 0.06369427       5.135200 5.860423 2.306908e-05 1.676353e-03 1.659355e-03
hsa04814 0.05076142       4.092520 4.921145 1.571086e-04 5.708278e-03 5.650395e-03
hsa04061 0.08000000       6.449811 6.143186 3.110376e-05 1.695155e-03 1.677966e-03
hsa03320 0.09210526       7.425770 6.305623 4.007301e-05 1.747183e-03 1.729467e-03
hsa04914 0.06306306       5.084311 4.853728 4.367132e-04 1.360050e-02 1.346259e-02
NA               NA             NA       NA           NA           NA           NA
NA.1             NA             NA       NA           NA           NA           NA
NA.2             NA             NA       NA           NA           NA           NA
                                                                           geneID Count
hsa04110 8318/991/9133/10403/890/983/4085/81620/7272/9212/1111/9319/891/4174/9232    15
hsa04114                          991/9133/983/4085/51806/6790/891/9232/3708/5241    10
hsa04218                           2305/4605/9133/890/983/51806/1111/891/776/3708    10
hsa04814                   9493/1062/81930/3832/3833/146909/10112/24137/4629/7802    10
hsa04061                                 3627/10563/6373/4283/6362/6355/9547/1524     8
hsa03320                                       4312/9415/9370/5105/2167/3158/5346     7
hsa04914                                          9133/890/983/4085/6790/891/5241     7
NA                                                                           <NA>    NA
NA.1                                                                         <NA>    NA
NA.2                                                                         <NA>    NA
         minlog10p.adj          x
hsa04110      6.766084 0.14150943
hsa04114      3.064796 0.09433962
hsa04218      2.775635 0.09433962
hsa04814      2.243495 0.09433962
hsa04061      2.770791 0.07547170
hsa03320      2.757662 0.06603774
hsa04914      1.866445 0.06603774
NA                  NA         NA
NA.1                NA         NA
NA.2                NA         NA
>

guidohooiveld avatar Apr 18 '25 15:04 guidohooiveld

I tried the data used in your example code. There is no issue and I got the same dotplot. However, for the project I am working on, pathways included in the dotplot does not match with the results returned by the enrichKEGG function.

Below Kegg_enrich is an object returned by function enrichKEGG.

if I do the following, the data frame will match with dotplot simply because object p only includes data used in generating the dotplot. p <- dotplot(Kegg_enrich) p$data[order(p$data$GeneRatio, decreasing = TRUE)[1:10], c("Description","GeneRatio","Count","p.adjust","qvalue")]

Image

Image

However, if we check the results in kegg_enrich (below), we can notice that Pathways of neurodegeneration - multiple diseases has the highest GeneRatio and the adjusted p-value is <0.05 but was not included in the dotplot.

kegg_enrich@result[order(kegg_enrich@result$Count,decreasing = T)[1:10],c("Description","GeneRatio","Count","p.adjust","qvalue")]

Image

xinming-an avatar Apr 18 '25 17:04 xinming-an

Would you be able to share the object Kegg_enrich? By doing something like save(Kegg_enrich, file="kegg.results.RData")

guidohooiveld avatar Apr 18 '25 19:04 guidohooiveld

I can share it if you can let me know your email.

xinming-an avatar Apr 18 '25 19:04 xinming-an

Mail address provided.

guidohooiveld avatar Apr 18 '25 20:04 guidohooiveld