diffxpy
diffxpy copied to clipboard
QUESTION: the p-value for multiple partitions of a data set
Hi, I was trying to find DEGs between two conditions while controlling the sample-driven effect. Following the tutorial, I used this script to conduct my analysis.
part = de.test.partition(
data=data_part,
parts="sample"
)
test_part = part.wald(
formula_loc="~ 1 + condition",
factor_loc_totest="condition"
)
Next, I was checking how diffxpy combine p-values from different groups and found this:
res = pd.DataFrame({
"gene": self.gene_ids,
# return minimal pval by gene:
"pval": np.min(self.pval.reshape(-1, self.pval.shape[-1]), axis=0),
# return minimal qval by gene:
"qval": np.min(self.qval.reshape(-1, self.qval.shape[-1]), axis=0),
# return maximal logFC by gene:
"log2fc": np.asarray(logfc),
# return mean expression across all groups by gene:
"mean": np.asarray(self.mean)
})
return res
Would you mind kindly telling me why to choose the minimum p value across groups?
I was wondering that it might increase the amount of significant genes in this way.
Would other methods, like fisher method https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.combine_pvalues.html, be better?
Hi @jingxinfu! thanks for the comment! Two components here are
- how to run multiple testing correction - this depends a bit whether one uses this functionality as a convenience but does not actually check all resutlts or whether really all tests are considered
- how to summarize qvalues - this is a data communication question. in the case of minimum, one would consider all tests, it would then make sense to FDR correct all together as well. now you are still left with #tests p values per gene, depending on you setting you may be interested in different questions here. essentially, a gene wise summary table will always simplify things here. if you are interested in a different summary, just let me know, this can be added in very small PRs! happy for feedback on what you would find useful!
I just want to add that fisher method https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.combine_pvalues.html is not very suitable here: if you want to have a single pvalue per gene, it s much cleaner to write up a GLM that covers all of these tests and tests all of these coefficients in a single test! we could still include it as an indication for where stuff is going on here though.