diffxpy icon indicating copy to clipboard operation
diffxpy copied to clipboard

de.test.pairwise very slow

Open aopisco opened this issue 6 years ago • 5 comments

@davidsebfischer do you have any plans for speeding up pairwise test?

currently I'm trying with an AnnData object with n_obs × n_vars = 1740 × 5829 but it is taking a really long long time

I'm using the same code as in your notebook:

test = de.test.pairwise(
    data=tiss,
    grouping="batch",
    test="z-test",
    noise_model="nb",
    sample_description=sample_description)

aopisco avatar Dec 19 '18 18:12 aopisco

Hi @aopisco , could you please share some information about your setup?

import batchglm
print(batchglm.__version__)
import diffxpy
print(diffxpy.__version__)

Also, do you use sparse AnnData or dense? You are already using a z-test, so there should be only one model fitting necessary. Therefore, if I had to guess, I'd assume that you are using a sparse AnnData object. This can really slow down calculations, so since your dataset is not very large it should not be a problem to convert it into a dense array (tiss.X = tiss.X.toarray())

Beside of that, what hardware are you using? Did you read the performance guide / install optimized versions of Tensorflow and NumPy?

Hoeze avatar Dec 19 '18 20:12 Hoeze

Hi @aopisco, thanks for reporting the issue! I am about to roll out a new version of the backend (batchglm), latest first week of January, this will also fix some remaining run time bottlenecks. Right now training takes long in some cases because the optimizer hyperparameters are not ideal yet for all scenarios, this will be improved in the new batchglm version. Would be great if you could report the versions and your setup in any case! If you havent optimzed tensorflow yet, dont do it just yet - it takes a long time in many cases and I have a feeling that this is a different issue.

davidsebfischer avatar Dec 19 '18 21:12 davidsebfischer

@Hoeze changing to dense() made a huge difference, thanks for the suggestion. regarding versions I'm using

import batchglm
print(batchglm.__version__)
v0.4.1+2.g63763e7
import diffxpy
print(diffxpy.__version__)
v0.4.2+49.g6f4ebc6

now I changed to test="wilcoxon" it gives

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-60-ed807d591abd> in <module>()
      4     test="wilcoxon",
      5 #     noise_model="nb",
----> 6     sample_description=sample_description
      7 )

~/maca-scanpy/diffxpy/diffxpy/testing/base.py in pairwise(data, grouping, as_numeric, test, lazy, gene_names, sample_description, noise_model, pval_correction, size_factors, batch_size, training_strategy, quick_scale, dtype, keep_full_test_objs, **kwargs)
   3477                     quick_scale=quick_scale,
   3478                     dtype=dtype,
-> 3479                     **kwargs
   3480                 )
   3481                 pvals[i, j] = de_test_temp.pval

~/maca-scanpy/diffxpy/diffxpy/testing/base.py in two_sample(data, grouping, as_numeric, test, gene_names, sample_description, noise_model, size_factors, batch_size, training_strategy, quick_scale, dtype, **kwargs)
   3275             gene_names=gene_names,
   3276             grouping=grouping,
-> 3277             dtype=dtype
   3278         )
   3279     else:

~/maca-scanpy/diffxpy/diffxpy/testing/base.py in wilcoxon(data, grouping, gene_names, sample_description, dtype)
   3095         data=X.astype(dtype),
   3096         grouping=grouping,
-> 3097         gene_names=gene_names,
   3098     )
   3099 

~/maca-scanpy/diffxpy/diffxpy/testing/base.py in __init__(self, data, grouping, gene_names)
    882 
    883         self._mean = np.mean(data, axis=0)
--> 884         self._pval = stats.wilcoxon_test(x0=x0.data, x1=x1.data)
    885         self._logfc = np.log(np.mean(x1, axis=0)) - np.log(np.mean(x0, axis=0)).data
    886         q = self.qval

~/maca-scanpy/diffxpy/diffxpy/stats/stats.py in wilcoxon_test(x0, x1)
     70             y=x1[:, i].flatten(),
     71             alternative='two-sided'
---> 72         ).pvalue for i in range(x0.shape[1])
     73     ])
     74     return pvals

~/maca-scanpy/diffxpy/diffxpy/stats/stats.py in <listcomp>(.0)
     70             y=x1[:, i].flatten(),
     71             alternative='two-sided'
---> 72         ).pvalue for i in range(x0.shape[1])
     73     ])
     74     return pvals

~/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py in mannwhitneyu(x, y, use_continuity, alternative)
   4895     T = tiecorrect(ranked)
   4896     if T == 0:
-> 4897         raise ValueError('All numbers are identical in mannwhitneyu')
   4898     sd = np.sqrt(T * n1 * n2 * (n1+n2+1) / 12.0)
   4899 

ValueError: All numbers are identical in mannwhitneyu

aopisco avatar Dec 19 '18 21:12 aopisco

@aopisco, I haven't forgotten this, I am finishing the new release of batchglm first and will address this in the new release of diffxpy after that.

davidsebfischer avatar Jan 04 '19 10:01 davidsebfischer

@aopisco You could now again use the inital test z-test with nb noise, this should be fast/normal speed with the new optimizers. I will next look into the issue with wilcoxon.

davidsebfischer avatar Jan 10 '19 17:01 davidsebfischer