diffxpy icon indicating copy to clipboard operation
diffxpy copied to clipboard

QUESTION: the p-value for multiple partitions of a data set

Open jingxinfu opened this issue 5 years ago • 2 comments

Hi, I was trying to find DEGs between two conditions while controlling the sample-driven effect. Following the tutorial, I used this script to conduct my analysis.

part = de.test.partition(
    data=data_part,
    parts="sample"
)
test_part = part.wald(
    formula_loc="~ 1 + condition",
    factor_loc_totest="condition"
)

Next, I was checking how diffxpy combine p-values from different groups and found this:

        res = pd.DataFrame({
            "gene": self.gene_ids,
            # return minimal pval by gene:
            "pval": np.min(self.pval.reshape(-1, self.pval.shape[-1]), axis=0),
            # return minimal qval by gene:
            "qval": np.min(self.qval.reshape(-1, self.qval.shape[-1]), axis=0),
            # return maximal logFC by gene:
            "log2fc": np.asarray(logfc),
            # return mean expression across all groups by gene:
            "mean": np.asarray(self.mean)
        })

        return res

Would you mind kindly telling me why to choose the minimum p value across groups? I was wondering that it might increase the amount of significant genes in this way. Would other methods, like fisher method https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.combine_pvalues.html, be better?

jingxinfu avatar Jul 10 '20 22:07 jingxinfu

Hi @jingxinfu! thanks for the comment! Two components here are

  • how to run multiple testing correction - this depends a bit whether one uses this functionality as a convenience but does not actually check all resutlts or whether really all tests are considered
  • how to summarize qvalues - this is a data communication question. in the case of minimum, one would consider all tests, it would then make sense to FDR correct all together as well. now you are still left with #tests p values per gene, depending on you setting you may be interested in different questions here. essentially, a gene wise summary table will always simplify things here. if you are interested in a different summary, just let me know, this can be added in very small PRs! happy for feedback on what you would find useful!

davidsebfischer avatar Jul 13 '20 09:07 davidsebfischer

I just want to add that fisher method https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.combine_pvalues.html is not very suitable here: if you want to have a single pvalue per gene, it s much cleaner to write up a GLM that covers all of these tests and tests all of these coefficients in a single test! we could still include it as an indication for where stuff is going on here though.

davidsebfischer avatar Jul 13 '20 09:07 davidsebfischer