rGREAT
rGREAT copied to clipboard
adjustment of p-value sensitivity in enrichment table results?
Thanks for all your work on this package!
I have a question regarding the p-values that are reported after using rGREAT::getEnrichmentTable
. When running several datasets through the local analysis with rGREAT
, the enrichment tables that are reported often contain several entries at the beginning that have a p-value of 0 and, subsequently, a p-adjusted value of 0 as well.
Does this have to do with rounding in the report in some way? This effect can be seen in the test dataset as well.
library(rGREAT)
obj <- readRDS(system.file("extdata", "GreatObject.rds", package = "rGREAT"))
getEnrichmentTable(obj)
Returns
id genome_fraction observed_region_hits fold_enrichment p_value p_adjust mean_tss_dist observed_gene_hits gene_set_size
1 HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.024706189 225 2.4298372 0.000000e+00 0.000000e+00 136782 96 197
2 HALLMARK_INTERFERON_GAMMA_RESPONSE 0.016214887 139 2.2871850 0.000000e+00 0.000000e+00 110305 68 189
3 HALLMARK_P53_PATHWAY 0.016053003 137 2.2770087 0.000000e+00 0.000000e+00 94877 74 197
4 HALLMARK_DNA_REPAIR 0.008096888 78 2.5702591 2.805534e-13 3.506917e-12 102965 41 148
5 HALLMARK_ALLOGRAFT_REJECTION 0.016780458 124 1.9715977 4.112710e-12 4.112710e-11 109342 63 186
6 HALLMARK_MTORC1_SIGNALING 0.016315292 116 1.8969835 1.875065e-10 1.562554e-09 71572 57 200
Thank you for your help!
@andvon May I ask if this problem has been solved? I met the same doubts and tried to modify ". Machine$double.eps" or "options(digits)", but neither works.
@liangliangjin I have not found a way around this yet, though I haven't really had the chance to look at it in recent days. Thanks for reminding me about this though; if I come across anything I will let you know!
@andvon I checked the rGREAT source code, and key code related to the p value is a simple ‘pbinom’ function. In addition, I searched some other webdata and guessed that this problem originated from the underlying logic of R language. For example the comments under the Chinese web pages: https://d.cosx.org/d/109642-109642, I have been confirmed in R such as "1-pbinom(22,48,0.05)=0", if the result is less than 2.2e-16, it is difficult to change the display result by any way or package. I also noticed at the GREAT Documentation of the web version that the p-values can reach 1E-324 (https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655456/Statistics), maybe their source code is based on C(++) or some other programming language? Since the species I study are not model animals, there is no way to solve this in R language.
@liangliangjin
Thanks for digging into the code! That convinced me to give it a look as well. I'll preface this by saying I'm not a statistician, but I have been able to replicate each of the p-values through this method. Many of those distribution functions in R have a log.p
argument which can be set to get these lower p-values. However, this typically gives a negative value. If you multiply that by -1, you get matching values. Here is some test code that does this:
library(rGREAT)
library(tidyverse)
obj <- readRDS(system.file("extdata", "GreatObject.rds", package = "rGREAT"))
getEnrichmentTable(obj) |>
tibble::as_tibble() |>
dplyr::select(id,genome_fraction,observed_region_hits,p_value) |>
dplyr::mutate(
updated_pvalue = -pbinom(
q = observed_region_hits - 1,
size = attr(obj,which = "n_total"),
prob = genome_fraction,
lower.tail = T,
log.p = T
)
)
The results for this look like the following:
# A tibble: 50 × 5
id genome_fraction observed_region_hits p_value updated_pvalue
<chr> <dbl> <dbl> <dbl> <dbl>
1 HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.0247 225 0 2.23e-32
2 HALLMARK_INTERFERON_GAMMA_RESPONSE 0.0162 139 0 2.78e-18
3 HALLMARK_P53_PATHWAY 0.0161 137 0 6.91e-18
4 HALLMARK_DNA_REPAIR 0.00810 78 2.81e-13 2.81e-13
5 HALLMARK_ALLOGRAFT_REJECTION 0.0168 124 4.11e-12 4.11e-12
6 HALLMARK_MTORC1_SIGNALING 0.0163 116 1.88e-10 1.88e-10
7 HALLMARK_HYPOXIA 0.0226 145 9.68e-10 9.68e-10
8 HALLMARK_INTERFERON_ALPHA_RESPONSE 0.00633 58 1.72e- 9 1.72e- 9
9 HALLMARK_MYC_TARGETS_V2 0.00262 32 1.57e- 8 1.57e- 8
10 HALLMARK_IL6_JAK_STAT3_SIGNALING 0.00588 52 3.52e- 8 3.52e- 8
So I think this is replicating the original calculation, but are using a log(p)
to get the answer instead!
@andvon Amazing operation, although I know the log parameter, I never thought of using it to output the original value. This is very useful to me, in my data with p_value, the original ranking does not accurately reflect the true p_value order. I will use the updated values to plot the fig, thank you again for your contribution.
@liangliangjin I'm glad you brought it to my attention, this kinda slipped my mind and it is something I have been meaning to get back to. Glad we could workshop a solution together, it will help my own processes as well. Best of luck!
@liangliangjin
Thanks for digging into the code! That convinced me to give it a look as well. I'll preface this by saying I'm not a statistician, but I have been able to replicate each of the p-values through this method. Many of those distribution functions in R have a argument which can be set to get these lower p-values. However, this typically gives a negative value. If you multiply that by -1, you get matching values. Here is some test code that does this:
log.p
library(rGREAT) library(tidyverse) obj <- readRDS(system.file("extdata", "GreatObject.rds", package = "rGREAT")) getEnrichmentTable(obj) |> tibble::as_tibble() |> dplyr::select(id,genome_fraction,observed_region_hits,p_value) |> dplyr::mutate( updated_pvalue = -pbinom( q = observed_region_hits - 1, size = attr(obj,which = "n_total"), prob = genome_fraction, lower.tail = T, log.p = T ) )
The results for this look like the following:
# A tibble: 50 × 5 id genome_fraction observed_region_hits p_value updated_pvalue <chr> <dbl> <dbl> <dbl> <dbl> 1 HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.0247 225 0 2.23e-32 2 HALLMARK_INTERFERON_GAMMA_RESPONSE 0.0162 139 0 2.78e-18 3 HALLMARK_P53_PATHWAY 0.0161 137 0 6.91e-18 4 HALLMARK_DNA_REPAIR 0.00810 78 2.81e-13 2.81e-13 5 HALLMARK_ALLOGRAFT_REJECTION 0.0168 124 4.11e-12 4.11e-12 6 HALLMARK_MTORC1_SIGNALING 0.0163 116 1.88e-10 1.88e-10 7 HALLMARK_HYPOXIA 0.0226 145 9.68e-10 9.68e-10 8 HALLMARK_INTERFERON_ALPHA_RESPONSE 0.00633 58 1.72e- 9 1.72e- 9 9 HALLMARK_MYC_TARGETS_V2 0.00262 32 1.57e- 8 1.57e- 8 10 HALLMARK_IL6_JAK_STAT3_SIGNALING 0.00588 52 3.52e- 8 3.52e- 8
So I think this is replicating the original calculation, but are using a to get the answer instead!
log(p)
you are the goat!!!!
@Z-H-Zhang @liangliangjin
I was looking through this, the adjustment needs a modification. I noticed that the top values lined up well, but the later values became inaccurate.
# Old modification
-pbinom(
q = observed_region_hits - 1,
size = attr(obj,which = "n_total"),
prob = genome_fraction,
lower.tail = T,
log.p = T
)
# Use the following instead:
exp(
pbinom(
q = observed_region_hits - 1,
size = attr(obj,which = "n_total"),
prob = genome_fraction,
lower.tail = F,
log.p = T
)
)
Comparing it against the full set of the original test results:
new old
1 2.225815e-32 0.000000e+00
2 2.776160e-18 0.000000e+00
3 6.913420e-18 0.000000e+00
4 2.805488e-13 2.805534e-13
5 4.112724e-12 4.112710e-12
6 1.875064e-10 1.875065e-10
7 9.682692e-10 9.682692e-10
8 1.717376e-09 1.717376e-09
9 1.566151e-08 1.566151e-08
10 3.516391e-08 3.516391e-08
11 4.605719e-08 4.605719e-08
12 2.334794e-07 2.334794e-07
13 3.937057e-07 3.937057e-07
14 2.702549e-06 2.702549e-06
15 2.367115e-05 2.367115e-05
16 3.751312e-05 3.751312e-05
17 4.754000e-05 4.754000e-05
18 1.182907e-04 1.182907e-04
19 5.712321e-04 5.712321e-04
20 7.862261e-04 7.862261e-04
21 1.226371e-03 1.226371e-03
22 3.535443e-03 3.535443e-03
23 3.819475e-03 3.819475e-03
24 1.807399e-02 1.807399e-02
25 2.088977e-02 2.088977e-02
26 2.577342e-02 2.577342e-02
27 2.718872e-02 2.718872e-02
28 3.144824e-02 3.144824e-02
29 3.602869e-02 3.602869e-02
30 3.681093e-02 3.681093e-02
31 4.021162e-02 4.021162e-02
32 4.449403e-02 4.449403e-02
33 7.449392e-02 7.449392e-02
34 7.912587e-02 7.912587e-02
35 2.297476e-01 2.297476e-01
36 2.318957e-01 2.318957e-01
37 2.804896e-01 2.804896e-01
38 2.822974e-01 2.822974e-01
39 3.733741e-01 3.733741e-01
40 4.979006e-01 4.979006e-01
41 5.763953e-01 5.763953e-01
42 7.608707e-01 7.608707e-01
43 8.282246e-01 8.282246e-01
44 8.880386e-01 8.880386e-01
45 9.122784e-01 9.122784e-01
46 9.422727e-01 9.422727e-01
47 9.766420e-01 9.766420e-01
48 9.997948e-01 9.997948e-01
49 1.000000e+00 1.000000e+00
50 1.000000e+00 1.000000e+00
Now everything lines up almost perfectly.
@andvon Thank you for the improvement. I had also noticed this problem before, so I only replaced the top data. By the way, very small p-values are still not displayed; the smallest value observed in my data is 6.4e-317, and anything smaller than this is shown as 0, which was the problem I initially mentioned.
eg,
exp(pbinom(q = 117856 - 1,size = 179513,prob = 0.60861892,lower.tail = F,log.p = T)) [1] 0
Therefore, from a non-computational perspective, it seems appropriate to consider using a more nuanced classification with fewer input regions. If it is necessary to filter the results with a large number of input regions, it is also a good choice to consider p-adjust while taking into account the folded enrichment values.
@liangliangjin
Sounds like you are hitting the limits of floating point arithmetic when you can't get past values like 1e-317. There is still a way around this; you can change your reporting format. Rather than report it as 1e-317, you could just report the log'd value. If you have ever run HOMER, you may have seen this. They provide a "normal" representation of the p-value, but their main representation is to report the natural log of the p-value.
I tend to prefer the log10 since it is a bit easier to interpret, which you could do in this way:
pbinom(q = 117856 - 1,size = 179513,prob = 0.60861892,lower.tail = F,log.p = T) / log(10)
# returns: -383.8804
You'd just keep it in that format and sort based on that criteria instead.
Validation of that:
pbinom(q = 5,size = 10,prob = 0.1,lower.tail = F,log.p = F)
# returns: 0.0001469026
exp_val <- pbinom(q = 5,size = 10,prob = 0.1,lower.tail = F,log.p = T)
# returns: -8.825741
exp(exp_val)
# returns: 0.0001469026
# Convert to log10
log10_val <- exp_val/log(10)
# returns: -3.832971
# Show actual p-value
10^log10_val
# returns: 0.0001469026
It probably is worthwhile to modify the pull request there to add a log or log10 P-value column for binom and hyper, but I didn't look that closely to see what additional modifications would need to be made.
edit: The tricky part to all of this would be incorporating the adjusted p-value. That may take additional, manual steps to incorporate your method properly!
For Bonferroni, I think you could probably just do something like
log10_val + log10(number_of_tests)