diffxpy
diffxpy copied to clipboard
Possible to set reference condition for two_sample tests?
I was originally trying to use the rank_genes_groups()
function in scanpy to get p-values and logfoldchanges between a single condition and a reference condition. My end-goal was to take the DE results and make a volcano plot. After reading some of the scanpy Github issue tickets (https://github.com/theislab/scanpy/issues/397), I learned of diffxpy and how it seems to be recommended for DEG over scanpy in general situations.
I like how there is less need to transform the diffxpy.api.test.t_test
output when compared to the output of scanpy.tl.rank_genes_groups
to get the data into a format to create a volcano plot for my code. However, it seems that no matter which order I supply the two conditions for the t-test, the resulting output is the same. This results in some volcano plots that, when compared to a plot made using scanpy.tl.rank_genes_groups
output, has about the same p-val but a log-fold change with the opposite sign.
Is there some easy way of forcing one of the conditions to be the reference condition for the diffxpy DEG tests? I apologize if I'm missing some inherent understanding or limitation of the tool.
Also, I think this question relates to #188 and to #184.
Hi @adkinsrs thanks for the issue. This is currently not supported as grouping formats are lost during parsing, the current state may also be desirable because this makes the test output invariable wrt data permutation. But I see your point here, enabling this would however require and extension of the code, specifically:
- definition of a baseline category as additional, optional parameter here https://github.com/theislab/diffxpy/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/tests.py#L824, (and for rank test)
- propagate choice to https://github.com/theislab/diffxpy/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/det.py#L1540
- define x0 as baseline group here https://github.com/theislab/diffxp/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/utils.py#L112
I will leave this open until addressed and would invite code contributions from interested parties via PR on dev branch!
@davidsebfischer, thank you for the response. From my testing and tinkering it seems the issue is using numpy's unique
function in https://github.com/theislab/diffxpy/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/utils.py#L114.
The numpy.unique
function sorts the unique values before returning the output. The pandas.Series.unique()
method actually returns the unique values based on the order encountered, which works for my use-case provided I pre-sort the adata.obs object beforehand. I went into my local copy of diffxpy and changed that line to pd.Series(grouping).unique()
, and it does give the output I was expecting as well as a mirrored volcano plot when I switch my two conditions.
I was thinking I could create a PR for this, but being it was a one-line change, I would want to know if this change is feasible for the bigger picture. Also should this be expanded as an option for the t-test/rank-test functions, or does that not matter so much since some people may not care which condition is the baseline/reference?
Thanks a lot for looking into this @adkinsrs! The issue with using pandas functions here is that the argument grouping
of split_x
does not necessarily receive pandas.Series
in all cases. In particular in this case, if grouping
was given to t_test
as string pointing to a column in an adata.obs
, this line her decomposes the pandas.Series
https://github.com/theislab/diffxpy/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/utils.py#L109. I guess you gave the pandas.Series
directly to t_test::grouping
instead of going via the string?
We could however extend your fix, encoding the error in the category order of the pandas.Series
, and return this order as additional return from https://github.com/theislab/diffxpy/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/utils.py#L105, which needs to be accepted in a few places centred around 2-group tests in https://github.com/theislab/diffxpy/blob/dev/diffxpy/testing/tests.py, and then add this additional group order input to
- [ ] https://github.com/theislab/diffxpy/blob/80478b0a12bb8160c746859260828c8ee7f510bc/diffxpy/testing/tests.py#L861
- [ ] https://github.com/theislab/diffxpy/blob/80478b0a12bb8160c746859260828c8ee7f510bc/diffxpy/testing/tests.py#L903
We could leave this option out for pairwise
and versus_rest
for now. I think having this as an additional argument in the code that is hidden from the user (outlined above) is desirable of encoding it as a pandas.Series
as they tend to make code harder to understand if one is skimming input types in this context, at least this was my experience and this is the reason why we force transform to numpy internally.
Thanks a lot for looking into this @adkinsrs! The issue with using pandas functions here is that the argument
grouping
ofsplit_x
does not necessarily receivepandas.Series
in all cases. In particular in this case, ifgrouping
was given tot_test
as string pointing to a column in anadata.obs
, this line her decomposes thepandas.Series
https://github.com/theislab/diffxpy/blob/b8c6ae0d7d957db72e41bc2e705c240348c66509/diffxpy/testing/utils.py#L109
. I guess you gave the
pandas.Series
directly tot_test::grouping
instead of going via the string?
Actually I passed a string to an adata.obs column to the "grouping" parameter. I will paste my transformation code from the REPL
Python 3.7.3 (default, Apr 19 2021, 15:07:07)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import anndata, diffxpy.api as de
/opt/Python-3.7.3/lib/python3.7/site-packages/pandas/compat/__init__.py:117: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
>>> adata = anndata.read("./df726e89-b7ac-d798-83bf-2bd69d7f3b52.h5ad")
>>> adata.obs
cluster replicate louvain
index
Ute_P1_Neg_0 TEC 5 TEC
Ute_P1_Neg_5 TEC 3 TEC
Ute_P1_GFP_3 SC (i) 8 SC (i)
Ute_P1_GFP_0 SC (ii) 23 SC (ii)
Ute_P1_tdTom_3 HC (ii) 15 HC (ii)
... ... ... ...
Ute_P1_Neg_31 TEC 25 TEC
Ute_P1_GFP_43 SC (ii) 32 SC (ii)
Ute_P1_GFP_45 SC (ii) 39 SC (ii)
Ute_P1_GFP_49 SC (ii) 30 SC (ii)
Ute_P1_tdTom_68 HC (iii-iv) 40 HC (iii-iv)
[158 rows x 3 columns]
>>> de_filter1 = adata.obs.cluster.isin(["HC (i)"])
>>> de_filter2 = adata.obs.cluster.isin(["SC (i)"])
>>> selected1 = adata[de_filter1, :]
>>> selected2 = adata[de_filter2, :]
>>> de_selected = selected1.concatenate(selected2) # Concatenate to sort in the order I want
>>> de_selected.obs
cluster replicate louvain batch
index
Ute_P1_tdTom_9-0 HC (i) 6 HC (i) 0
Ute_P1_tdTom_0-0 HC (i) 13 HC (i) 0
Ute_P1_Neg_26-0 HC (i) 1 HC (i) 0
Ute_P1_tdTom_19-0 HC (i) 12 HC (i) 0
Ute_P1_tdTom_37-0 HC (i) 9 HC (i) 0
Ute_P1_tdTom_20-0 HC (i) 14 HC (i) 0
Ute_P1_tdTom_25-0 HC (i) 11 HC (i) 0
Ute_P1_tdTom_47-0 HC (i) 4 HC (i) 0
Ute_P1_tdTom_57-0 HC (i) 10 HC (i) 0
Ute_P1_tdTom_52-0 HC (i) 8 HC (i) 0
Ute_P1_tdTom_66-0 HC (i) 2 HC (i) 0
Ute_P1_tdTom_63-0 HC (i) 3 HC (i) 0
Ute_P1_tdTom_60-0 HC (i) 5 HC (i) 0
Ute_P1_tdTom_64-0 HC (i) 7 HC (i) 0
Ute_P1_GFP_3-1 SC (i) 8 SC (i) 1
Ute_P1_GFP_1-1 SC (i) 9 SC (i) 1
Ute_P1_Neg_3-1 SC (i) 4 SC (i) 1
Ute_P1_GFP_2-1 SC (i) 11 SC (i) 1
Ute_P1_GFP_10-1 SC (i) 10 SC (i) 1
Ute_P1_Neg_15-1 SC (i) 2 SC (i) 1
Ute_P1_Neg_20-1 SC (i) 1 SC (i) 1
Ute_P1_Neg_19-1 SC (i) 3 SC (i) 1
Ute_P1_GFP_9-1 SC (i) 6 SC (i) 1
Ute_P1_GFP_26-1 SC (i) 7 SC (i) 1
Ute_P1_GFP_34-1 SC (i) 14 SC (i) 1
Ute_P1_GFP_37-1 SC (i) 13 SC (i) 1
Ute_P1_GFP_35-1 SC (i) 15 SC (i) 1
Ute_P1_Neg_34-1 SC (i) 5 SC (i) 1
Ute_P1_GFP_48-1 SC (i) 18 SC (i) 1
Ute_P1_GFP_39-1 SC (i) 17 SC (i) 1
Ute_P1_GFP_40-1 SC (i) 16 SC (i) 1
Ute_P1_GFP_53-1 SC (i) 12 SC (i) 1
>>> de_selected.obs.cluster.unique() # HC is encountered before SC
array(['HC (i)', 'SC (i)'], dtype=object)
>>> de_selected_opposite = selected2.concatenate(selected1)
>>> de_selected_opposite.obs.cluster.unique() # SC is encountered before HC. The "np.unique(de_selected_opposite.obs.cluster)" call gives HC first, due to the sorting of uniq vals
array(['SC (i)', 'HC (i)'], dtype=object)
>>> de.test.t_test(de_selected, grouping="cluster", gene_names=de_selected.var["gene_symbol"], is_logged=True).summary()
gene pval qval log2fc mean zero_mean zero_variance
0 0610005C13Rik NaN NaN 0.000000 0.000000 True True
1 0610009B22Rik 0.242574 0.584453 -1.766539 5.018637 False False
2 0610009E02Rik 0.528240 0.753421 0.893326 2.728064 False False
3 0610009L18Rik 0.724149 0.872133 0.249389 0.570136 False False
4 0610010F05Rik 0.473403 0.716048 -1.073669 1.446289 False False
... ... ... ... ... ... ... ...
22807 Zyg11a NaN NaN 0.000000 0.000000 True True
22808 Zyg11b 0.272259 0.584453 -1.095174 0.903192 False False
22809 Zyx 0.488327 0.727949 1.132955 6.172250 False False
22810 Zzef1 0.952420 0.980600 -0.075449 0.599622 False False
22811 Zzz3 0.972575 0.989865 0.046684 2.544517 False False
[22812 rows x 7 columns]
>>> de.test.t_test(de_selected_opposite, grouping="cluster", gene_names=de_selected_opposite.var["gene_symbol"], is_logged=True).summary()
gene pval qval log2fc mean zero_mean zero_variance
0 0610005C13Rik NaN NaN 0.000000 0.000000 True True
1 0610009B22Rik 0.242574 0.584453 1.766539 5.018637 False False
2 0610009E02Rik 0.528240 0.753421 -0.893326 2.728064 False False
3 0610009L18Rik 0.724149 0.872133 -0.249389 0.570136 False False
4 0610010F05Rik 0.473403 0.716048 1.073669 1.446289 False False
... ... ... ... ... ... ... ...
22807 Zyg11a NaN NaN 0.000000 0.000000 True True
22808 Zyg11b 0.272259 0.584453 1.095174 0.903192 False False
22809 Zyx 0.488327 0.727949 -1.132955 6.172250 False False
22810 Zzef1 0.952420 0.980600 0.075449 0.599622 False False
22811 Zzz3 0.972575 0.989865 -0.046684 2.544517 False False
[22812 rows x 7 columns]
# For each gene between the two summary dataframes, the pval/qval is the same, but the log2fc is of the opposite sign due to switched order of HC (i) and SC (i)
We are currently using @adkinsrs version in production to address issues we had.
I'd like to mention that on version 'v0.7.4+21.g12f1286
the problem remains, and only https://github.com/theislab/diffxpy/issues/205#issuecomment-882643747 solves the issue.