fgsea icon indicating copy to clipboard operation
fgsea copied to clipboard

Questions related to fgsea algorithm

Open lucolotto opened this issue 2 years ago • 1 comments

Hi there! Thanks a lot for your package, it's really a great tool.

I have been using it for many years, but I recently came across some methodological discussion in relation to the methods of the GSEA (not specifically your library) in general, that I would really try to answer. I tried to have a look at the R code but I was not able to answer my questions, so I am asking you here, confident in your kind reply.

Assuming I am running the following:

data(examplePathways)
data(exampleRanks)

fgseaRes <- fgsea(pathways = examplePathways, 
                  stats    = exampleRanks,
                  minSize  = 15,
                  maxSize  = 500)

I have the following questions:

  1. which statistics is used to obtain the p-val for the ES? I.e. Kolmogorov Smirnov test, t-test o Wilconxon rank-sum, or something else?
  2. which method is used to estimate the null-distribution? I.e. gene sampling, phenotype permutation or something else?
  3. what type of null-hypothesis is used? I.e. competitive nulla Hypothesis, self-contained nulla Hypothesis

Looking forward to hearing from you soon, Thanks a lot in advance Luca

lucolotto avatar Sep 12 '22 20:09 lucolotto

@lucolotto Hi! Thanks for your interest in this package. I’m gonna try to answer these questions (hope that @assaron will correct me if there is something wrong or should be added).

  1. Enrichment Score (ES) that is used in GSEA analysis is the weighted Kolmogorov-Smirnov-like statistic. The formal definition of the score is given by the maximum deviation of two cumulative distribution functions (CDF). The formal definition can be found here. In the next figure, I’m trying to show the meaning of ES: cdf_examples So, basically, ES in GSEA adds some weights for CDF that correspond to genes from the gene set (left part of the figure). This is expressed in the height of the step, so for each gene from the set, the height of the step is proportional to the value of statistics for this gene (this may be t-statistics after analyzing the differential expression of genes). While for the usual Kolmogorov Smirnov test, the heights of the steps are the same. At the same time, the cdf for genes that are outside the set does not change (right part of the figure).
  2. Main function fgsea from fgsea package performs gene sampling for estimating the empirical P-values.
  3. GSEA is testing the competitive null hypothesis since the method compares a gene set against the background of all genes, not in the set.

vdsukhov avatar Sep 20 '22 23:09 vdsukhov