diffxpy icon indicating copy to clipboard operation
diffxpy copied to clipboard

de.test.wilcoxon fails if dataset contains all-zero genes

Open dburkhardt opened this issue 6 years ago • 7 comments
trafficstars

Ran into this while using diffxpy checking DE between all clusters in a sample:

Usage looks like:

for cluster in np.unique(metadata['clusters']):
    in_cluster = metadata['clusters'] == cluster
    curr_data = data.loc[in_cluster]
    test = de.test.wilcoxon(
               data=curr_data,
               grouping=metadata['condition'].loc[in_cluster],
               gene_names=curr_data.columns)

Solution here was to just remove all empty columns. Only issue is the the error that got raised was:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-115-87abff392829> in <module>
      6                data=curr_data ** 2,
      7                grouping=metadata['condition'].loc[in_cluster],
----> 8                gene_names=curr_data.columns)
      9     results = test.summary()
     10     fn = '../files/differential_expression.{}.csv'.format(cluster)

~/.local/lib/python3.7/site-packages/diffxpy/testing/base.py in wilcoxon(data, grouping, gene_names, sample_description, dtype)
   2104         data=X.astype(dtype),
   2105         grouping=grouping,
-> 2106         gene_names=gene_names,
   2107     )
   2108 

~/.local/lib/python3.7/site-packages/diffxpy/testing/base.py in __init__(self, data, grouping, gene_names)
    802 
    803         self._mean = np.mean(data, axis=0)
--> 804         self._pval = stats.wilcoxon_test(x0=x0.data, x1=x1.data)
    805         self._logfc = np.log(np.mean(x1, axis=0)) - np.log(np.mean(x0, axis=0)).data
    806         q = self.qval

~/.local/lib/python3.7/site-packages/diffxpy/stats/stats.py in wilcoxon_test(x0, x1)
     69             y=x1[:, i].flatten(),
     70             alternative='two-sided'
---> 71         ).pvalue for i in range(x0.shape[1])
     72     ])
     73     return pvals

~/.local/lib/python3.7/site-packages/diffxpy/stats/stats.py in <listcomp>(.0)
     69             y=x1[:, i].flatten(),
     70             alternative='two-sided'
---> 71         ).pvalue for i in range(x0.shape[1])
     72     ])
     73     return pvals

~/.local/lib/python3.7/site-packages/scipy/stats/stats.py in mannwhitneyu(x, y, use_continuity, alternative)
   5684     T = tiecorrect(ranked)
   5685     if T == 0:
-> 5686         raise ValueError('All numbers are identical in mannwhitneyu')
   5687     sd = np.sqrt(T * n1 * n2 * (n1+n2+1) / 12.0)
   5688 

ValueError: All numbers are identical in mannwhitneyu

For inexperienced users, they might not realize that they have an issue with some genes not being expression in some of the clusters. I would either raise a more informative error or throw a warning and handle removing the empty columns.

dburkhardt avatar Aug 12 '19 16:08 dburkhardt

Hi @dburkhardt! Thanks for the issue! I ll try to rephrase your scenario: Do you want to test each cluster against all other clusters? Or each cluster against every other cluster? We have wrappers that deal with that (de.test.versus_rest, de.test.pairwise) so that you do not have to bother with getting the for loop and subsetting right.

Secondly, the exception thrown by mannwhitneyu. The error that you encountered seems to arise if they are all zero in both clusters. I agree, this should be handled by exception handling. I ll look into that and will return non-signficant in this case ie p=1. Thanks!

davidsebfischer avatar Aug 14 '19 10:08 davidsebfischer

I just improved the zero variance handling on the dev branch. I noticed that the problem that you encountered should have been caught by an update that I pushed to dev some time ago and looking at the error trace I think you might be using an earlier version. Could you try using a local installation from the dev branch for diffxpy for now? I ll push another release to master next week probably.

davidsebfischer avatar Aug 14 '19 11:08 davidsebfischer

The necessary versions of batchglm and diffxpy are now on dev and master each and on pypi.

davidsebfischer avatar Aug 14 '19 15:08 davidsebfischer

Okay thanks, I'll try this. On vacation for a few days but I'll add to my to-do for next week. To explain the use case: I have 8 clusters found by merging datasets from 2 conditions and I'm getting the DE genes within each cluster across conditions. I think this is different from the cluster vs other clusters comparisons you're describing.

dburkhardt avatar Aug 15 '19 14:08 dburkhardt

Yes, that is different from versus_rest, we handle this with de.test.partition, I'll add a tutorial for that in the next couple of days!

davidsebfischer avatar Aug 15 '19 20:08 davidsebfischer

Okay I can confirm that the empty genes error is handled in the latest version on master. It looks like the wilcoxon function has been renamed to rank_test, but this isn't reflected in the readthedocs, yet.

dburkhardt avatar Aug 22 '19 14:08 dburkhardt

Thanks for the note, this is corrected in rtfd now, improved tutorial will follow soon!

davidsebfischer avatar Aug 25 '19 18:08 davidsebfischer