diffxpy
diffxpy copied to clipboard
de.test.wilcoxon fails if dataset contains all-zero genes
Ran into this while using diffxpy checking DE between all clusters in a sample:
Usage looks like:
for cluster in np.unique(metadata['clusters']):
in_cluster = metadata['clusters'] == cluster
curr_data = data.loc[in_cluster]
test = de.test.wilcoxon(
data=curr_data,
grouping=metadata['condition'].loc[in_cluster],
gene_names=curr_data.columns)
Solution here was to just remove all empty columns. Only issue is the the error that got raised was:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-115-87abff392829> in <module>
6 data=curr_data ** 2,
7 grouping=metadata['condition'].loc[in_cluster],
----> 8 gene_names=curr_data.columns)
9 results = test.summary()
10 fn = '../files/differential_expression.{}.csv'.format(cluster)
~/.local/lib/python3.7/site-packages/diffxpy/testing/base.py in wilcoxon(data, grouping, gene_names, sample_description, dtype)
2104 data=X.astype(dtype),
2105 grouping=grouping,
-> 2106 gene_names=gene_names,
2107 )
2108
~/.local/lib/python3.7/site-packages/diffxpy/testing/base.py in __init__(self, data, grouping, gene_names)
802
803 self._mean = np.mean(data, axis=0)
--> 804 self._pval = stats.wilcoxon_test(x0=x0.data, x1=x1.data)
805 self._logfc = np.log(np.mean(x1, axis=0)) - np.log(np.mean(x0, axis=0)).data
806 q = self.qval
~/.local/lib/python3.7/site-packages/diffxpy/stats/stats.py in wilcoxon_test(x0, x1)
69 y=x1[:, i].flatten(),
70 alternative='two-sided'
---> 71 ).pvalue for i in range(x0.shape[1])
72 ])
73 return pvals
~/.local/lib/python3.7/site-packages/diffxpy/stats/stats.py in <listcomp>(.0)
69 y=x1[:, i].flatten(),
70 alternative='two-sided'
---> 71 ).pvalue for i in range(x0.shape[1])
72 ])
73 return pvals
~/.local/lib/python3.7/site-packages/scipy/stats/stats.py in mannwhitneyu(x, y, use_continuity, alternative)
5684 T = tiecorrect(ranked)
5685 if T == 0:
-> 5686 raise ValueError('All numbers are identical in mannwhitneyu')
5687 sd = np.sqrt(T * n1 * n2 * (n1+n2+1) / 12.0)
5688
ValueError: All numbers are identical in mannwhitneyu
For inexperienced users, they might not realize that they have an issue with some genes not being expression in some of the clusters. I would either raise a more informative error or throw a warning and handle removing the empty columns.
Hi @dburkhardt! Thanks for the issue! I ll try to rephrase your scenario: Do you want to test each cluster against all other clusters? Or each cluster against every other cluster? We have wrappers that deal with that (de.test.versus_rest, de.test.pairwise) so that you do not have to bother with getting the for loop and subsetting right.
Secondly, the exception thrown by mannwhitneyu. The error that you encountered seems to arise if they are all zero in both clusters. I agree, this should be handled by exception handling. I ll look into that and will return non-signficant in this case ie p=1. Thanks!
I just improved the zero variance handling on the dev branch. I noticed that the problem that you encountered should have been caught by an update that I pushed to dev some time ago and looking at the error trace I think you might be using an earlier version. Could you try using a local installation from the dev branch for diffxpy for now? I ll push another release to master next week probably.
The necessary versions of batchglm and diffxpy are now on dev and master each and on pypi.
Okay thanks, I'll try this. On vacation for a few days but I'll add to my to-do for next week. To explain the use case: I have 8 clusters found by merging datasets from 2 conditions and I'm getting the DE genes within each cluster across conditions. I think this is different from the cluster vs other clusters comparisons you're describing.
Yes, that is different from versus_rest, we handle this with de.test.partition, I'll add a tutorial for that in the next couple of days!
Okay I can confirm that the empty genes error is handled in the latest version on master. It looks like the wilcoxon function has been renamed to rank_test, but this isn't reflected in the readthedocs, yet.
Thanks for the note, this is corrected in rtfd now, improved tutorial will follow soon!