rGREAT icon indicating copy to clipboard operation
rGREAT copied to clipboard

adjustment of p-value sensitivity in enrichment table results?

Open andvon opened this issue 1 year ago • 10 comments

Thanks for all your work on this package!

I have a question regarding the p-values that are reported after using rGREAT::getEnrichmentTable . When running several datasets through the local analysis with rGREAT, the enrichment tables that are reported often contain several entries at the beginning that have a p-value of 0 and, subsequently, a p-adjusted value of 0 as well.

Does this have to do with rounding in the report in some way? This effect can be seen in the test dataset as well.

library(rGREAT)
obj <- readRDS(system.file("extdata", "GreatObject.rds", package = "rGREAT"))
getEnrichmentTable(obj)

Returns

                                           id genome_fraction observed_region_hits fold_enrichment      p_value     p_adjust mean_tss_dist observed_gene_hits gene_set_size
1            HALLMARK_TNFA_SIGNALING_VIA_NFKB     0.024706189                  225       2.4298372 0.000000e+00 0.000000e+00        136782                 96           197
2          HALLMARK_INTERFERON_GAMMA_RESPONSE     0.016214887                  139       2.2871850 0.000000e+00 0.000000e+00        110305                 68           189
3                        HALLMARK_P53_PATHWAY     0.016053003                  137       2.2770087 0.000000e+00 0.000000e+00         94877                 74           197
4                         HALLMARK_DNA_REPAIR     0.008096888                   78       2.5702591 2.805534e-13 3.506917e-12        102965                 41           148
5                HALLMARK_ALLOGRAFT_REJECTION     0.016780458                  124       1.9715977 4.112710e-12 4.112710e-11        109342                 63           186
6                   HALLMARK_MTORC1_SIGNALING     0.016315292                  116       1.8969835 1.875065e-10 1.562554e-09         71572                 57           200

Thank you for your help!

andvon avatar Jan 08 '24 22:01 andvon

@andvon May I ask if this problem has been solved? I met the same doubts and tried to modify ". Machine$double.eps" or "options(digits)", but neither works.

liangliangjin avatar Jul 28 '24 19:07 liangliangjin

@liangliangjin I have not found a way around this yet, though I haven't really had the chance to look at it in recent days. Thanks for reminding me about this though; if I come across anything I will let you know!

andvon avatar Jul 29 '24 02:07 andvon

@andvon I checked the rGREAT source code, and key code related to the p value is a simple ‘pbinom’ function. In addition, I searched some other webdata and guessed that this problem originated from the underlying logic of R language. For example the comments under the Chinese web pages: https://d.cosx.org/d/109642-109642, I have been confirmed in R such as "1-pbinom(22,48,0.05)=0", if the result is less than 2.2e-16, it is difficult to change the display result by any way or package. I also noticed at the GREAT Documentation of the web version that the p-values can reach 1E-324 (https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655456/Statistics), maybe their source code is based on C(++) or some other programming language? Since the species I study are not model animals, there is no way to solve this in R language.

liangliangjin avatar Jul 30 '24 09:07 liangliangjin

@liangliangjin

Thanks for digging into the code! That convinced me to give it a look as well. I'll preface this by saying I'm not a statistician, but I have been able to replicate each of the p-values through this method. Many of those distribution functions in R have a log.p argument which can be set to get these lower p-values. However, this typically gives a negative value. If you multiply that by -1, you get matching values. Here is some test code that does this:

library(rGREAT)
library(tidyverse)

obj <- readRDS(system.file("extdata", "GreatObject.rds", package = "rGREAT"))

getEnrichmentTable(obj) |>
	tibble::as_tibble() |>
	dplyr::select(id,genome_fraction,observed_region_hits,p_value) |>
	dplyr::mutate(
		updated_pvalue = -pbinom(
			q = observed_region_hits - 1,
			size = attr(obj,which = "n_total"),
			prob = genome_fraction,
			lower.tail = T,
			log.p = T
		)
	)

The results for this look like the following:

# A tibble: 50 × 5
   id                                 genome_fraction observed_region_hits  p_value updated_pvalue
   <chr>                                        <dbl>                <dbl>    <dbl>          <dbl>
 1 HALLMARK_TNFA_SIGNALING_VIA_NFKB           0.0247                   225 0              2.23e-32
 2 HALLMARK_INTERFERON_GAMMA_RESPONSE         0.0162                   139 0              2.78e-18
 3 HALLMARK_P53_PATHWAY                       0.0161                   137 0              6.91e-18
 4 HALLMARK_DNA_REPAIR                        0.00810                   78 2.81e-13       2.81e-13
 5 HALLMARK_ALLOGRAFT_REJECTION               0.0168                   124 4.11e-12       4.11e-12
 6 HALLMARK_MTORC1_SIGNALING                  0.0163                   116 1.88e-10       1.88e-10
 7 HALLMARK_HYPOXIA                           0.0226                   145 9.68e-10       9.68e-10
 8 HALLMARK_INTERFERON_ALPHA_RESPONSE         0.00633                   58 1.72e- 9       1.72e- 9
 9 HALLMARK_MYC_TARGETS_V2                    0.00262                   32 1.57e- 8       1.57e- 8
10 HALLMARK_IL6_JAK_STAT3_SIGNALING           0.00588                   52 3.52e- 8       3.52e- 8

So I think this is replicating the original calculation, but are using a log(p) to get the answer instead!

andvon avatar Jul 30 '24 13:07 andvon

@andvon Amazing operation, although I know the log parameter, I never thought of using it to output the original value. This is very useful to me, in my data with p_value, the original ranking does not accurately reflect the true p_value order. I will use the updated values to plot the fig, thank you again for your contribution.

liangliangjin avatar Jul 30 '24 14:07 liangliangjin

@liangliangjin I'm glad you brought it to my attention, this kinda slipped my mind and it is something I have been meaning to get back to. Glad we could workshop a solution together, it will help my own processes as well. Best of luck!

andvon avatar Jul 30 '24 15:07 andvon

@liangliangjin

Thanks for digging into the code! That convinced me to give it a look as well. I'll preface this by saying I'm not a statistician, but I have been able to replicate each of the p-values through this method. Many of those distribution functions in R have a argument which can be set to get these lower p-values. However, this typically gives a negative value. If you multiply that by -1, you get matching values. Here is some test code that does this:log.p

library(rGREAT)
library(tidyverse)

obj <- readRDS(system.file("extdata", "GreatObject.rds", package = "rGREAT"))

getEnrichmentTable(obj) |>
	tibble::as_tibble() |>
	dplyr::select(id,genome_fraction,observed_region_hits,p_value) |>
	dplyr::mutate(
		updated_pvalue = -pbinom(
			q = observed_region_hits - 1,
			size = attr(obj,which = "n_total"),
			prob = genome_fraction,
			lower.tail = T,
			log.p = T
		)
	)

The results for this look like the following:

# A tibble: 50 × 5
   id                                 genome_fraction observed_region_hits  p_value updated_pvalue
   <chr>                                        <dbl>                <dbl>    <dbl>          <dbl>
 1 HALLMARK_TNFA_SIGNALING_VIA_NFKB           0.0247                   225 0              2.23e-32
 2 HALLMARK_INTERFERON_GAMMA_RESPONSE         0.0162                   139 0              2.78e-18
 3 HALLMARK_P53_PATHWAY                       0.0161                   137 0              6.91e-18
 4 HALLMARK_DNA_REPAIR                        0.00810                   78 2.81e-13       2.81e-13
 5 HALLMARK_ALLOGRAFT_REJECTION               0.0168                   124 4.11e-12       4.11e-12
 6 HALLMARK_MTORC1_SIGNALING                  0.0163                   116 1.88e-10       1.88e-10
 7 HALLMARK_HYPOXIA                           0.0226                   145 9.68e-10       9.68e-10
 8 HALLMARK_INTERFERON_ALPHA_RESPONSE         0.00633                   58 1.72e- 9       1.72e- 9
 9 HALLMARK_MYC_TARGETS_V2                    0.00262                   32 1.57e- 8       1.57e- 8
10 HALLMARK_IL6_JAK_STAT3_SIGNALING           0.00588                   52 3.52e- 8       3.52e- 8

So I think this is replicating the original calculation, but are using a to get the answer instead!log(p)

you are the goat!!!!

Z-H-Zhang avatar Aug 08 '24 01:08 Z-H-Zhang

@Z-H-Zhang @liangliangjin

I was looking through this, the adjustment needs a modification. I noticed that the top values lined up well, but the later values became inaccurate.

# Old modification
-pbinom(
        q = observed_region_hits - 1,
        size = attr(obj,which = "n_total"),
	prob = genome_fraction,
	lower.tail = T,
        log.p = T
)

# Use the following instead:
exp(
  pbinom(
    q = observed_region_hits - 1,
    size = attr(obj,which = "n_total"),
    prob = genome_fraction,
    lower.tail = F,
    log.p = T
   )
)

Comparing it against the full set of the original test results:

           new          old                                                  
1  2.225815e-32 0.000000e+00
2  2.776160e-18 0.000000e+00
3  6.913420e-18 0.000000e+00
4  2.805488e-13 2.805534e-13
5  4.112724e-12 4.112710e-12
6  1.875064e-10 1.875065e-10
7  9.682692e-10 9.682692e-10
8  1.717376e-09 1.717376e-09
9  1.566151e-08 1.566151e-08
10 3.516391e-08 3.516391e-08
11 4.605719e-08 4.605719e-08
12 2.334794e-07 2.334794e-07
13 3.937057e-07 3.937057e-07
14 2.702549e-06 2.702549e-06
15 2.367115e-05 2.367115e-05
16 3.751312e-05 3.751312e-05
17 4.754000e-05 4.754000e-05
18 1.182907e-04 1.182907e-04
19 5.712321e-04 5.712321e-04
20 7.862261e-04 7.862261e-04
21 1.226371e-03 1.226371e-03
22 3.535443e-03 3.535443e-03
23 3.819475e-03 3.819475e-03
24 1.807399e-02 1.807399e-02
25 2.088977e-02 2.088977e-02
26 2.577342e-02 2.577342e-02
27 2.718872e-02 2.718872e-02
28 3.144824e-02 3.144824e-02
29 3.602869e-02 3.602869e-02
30 3.681093e-02 3.681093e-02
31 4.021162e-02 4.021162e-02
32 4.449403e-02 4.449403e-02
33 7.449392e-02 7.449392e-02
34 7.912587e-02 7.912587e-02
35 2.297476e-01 2.297476e-01
36 2.318957e-01 2.318957e-01
37 2.804896e-01 2.804896e-01
38 2.822974e-01 2.822974e-01
39 3.733741e-01 3.733741e-01
40 4.979006e-01 4.979006e-01
41 5.763953e-01 5.763953e-01
42 7.608707e-01 7.608707e-01
43 8.282246e-01 8.282246e-01
44 8.880386e-01 8.880386e-01
45 9.122784e-01 9.122784e-01
46 9.422727e-01 9.422727e-01
47 9.766420e-01 9.766420e-01
48 9.997948e-01 9.997948e-01
49 1.000000e+00 1.000000e+00
50 1.000000e+00 1.000000e+00

Now everything lines up almost perfectly.

andvon avatar Aug 09 '24 16:08 andvon

@andvon Thank you for the improvement. I had also noticed this problem before, so I only replaced the top data. By the way, very small p-values are still not displayed; the smallest value observed in my data is 6.4e-317, and anything smaller than this is shown as 0, which was the problem I initially mentioned.

eg,

exp(pbinom(q = 117856 - 1,size = 179513,prob = 0.60861892,lower.tail = F,log.p = T)) [1] 0

Therefore, from a non-computational perspective, it seems appropriate to consider using a more nuanced classification with fewer input regions. If it is necessary to filter the results with a large number of input regions, it is also a good choice to consider p-adjust while taking into account the folded enrichment values.

liangliangjin avatar Aug 10 '24 15:08 liangliangjin

@liangliangjin

Sounds like you are hitting the limits of floating point arithmetic when you can't get past values like 1e-317. There is still a way around this; you can change your reporting format. Rather than report it as 1e-317, you could just report the log'd value. If you have ever run HOMER, you may have seen this. They provide a "normal" representation of the p-value, but their main representation is to report the natural log of the p-value.

I tend to prefer the log10 since it is a bit easier to interpret, which you could do in this way:

pbinom(q = 117856 - 1,size = 179513,prob = 0.60861892,lower.tail = F,log.p = T) / log(10)

# returns: -383.8804

You'd just keep it in that format and sort based on that criteria instead.

Validation of that:

pbinom(q = 5,size = 10,prob = 0.1,lower.tail = F,log.p = F)
# returns: 0.0001469026

exp_val <- pbinom(q = 5,size = 10,prob = 0.1,lower.tail = F,log.p = T)
# returns: -8.825741

exp(exp_val)
# returns:  0.0001469026

# Convert to log10
log10_val <- exp_val/log(10)
# returns: -3.832971

# Show actual p-value
10^log10_val
# returns: 0.0001469026

It probably is worthwhile to modify the pull request there to add a log or log10 P-value column for binom and hyper, but I didn't look that closely to see what additional modifications would need to be made.

edit: The tricky part to all of this would be incorporating the adjusted p-value. That may take additional, manual steps to incorporate your method properly!

For Bonferroni, I think you could probably just do something like

log10_val + log10(number_of_tests)

andvon avatar Aug 11 '24 14:08 andvon