pertpy icon indicating copy to clipboard operation
pertpy copied to clipboard

Inconsistent covariate formatting for credible scCODA effects

Open mschilli87 opened this issue 8 months ago • 6 comments

Please make sure these conditions are met

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pertpy.
  • [ ] (optional) I have confirmed this bug exists on the main branch.

Report

Consider the following scCODA run (as per the pertpy documentation):

import pertpy as pt
haber_cells = pt.dt.haber_2017_regions()
sccoda = pt.tl.Sccoda()
mdata = sccoda.load(haber_cells,
                    type="cell_level",
                    generate_sample_level=True,
                    cell_type_identifier="cell_label",
                    sample_identifier="batch",
                    covariate_obs=["condition"])
mdata = sccoda.prepare(mdata,
                       formula="condition",
                       reference_cell_type="Endocrine")
sccoda.run_nuts(mdata,
                num_warmup=100,
                num_samples=1000,
                rng_key=42)

pertpy.tools.Sccoda provides two methods to query effects:

effects = sccoda.get_effect_df(mdata)
credible_effects = sccoda.credible_effects(mdata)

However, they use different Covariate styles to refer to the same covarites:

print(effects.index.get_level_values("Covariate"))
Index(['conditionT.Hpoly.Day3', 'conditionT.Hpoly.Day3',
       'conditionT.Hpoly.Day3', 'conditionT.Hpoly.Day3',
       'conditionT.Hpoly.Day3', 'conditionT.Hpoly.Day3',
       'conditionT.Hpoly.Day3', 'conditionT.Hpoly.Day3',
       'conditionT.Hpoly.Day10', 'conditionT.Hpoly.Day10',
       'conditionT.Hpoly.Day10', 'conditionT.Hpoly.Day10',
       'conditionT.Hpoly.Day10', 'conditionT.Hpoly.Day10',
       'conditionT.Hpoly.Day10', 'conditionT.Hpoly.Day10',
       'conditionT.Salmonella', 'conditionT.Salmonella',
       'conditionT.Salmonella', 'conditionT.Salmonella',
       'conditionT.Salmonella', 'conditionT.Salmonella',
       'conditionT.Salmonella', 'conditionT.Salmonella'],
      dtype='object', name='Covariate')
print(credible_effects.index.get_level_values("Covariate"))
Index(['condition[T.Hpoly.Day3]', 'condition[T.Hpoly.Day3]',
       'condition[T.Hpoly.Day3]', 'condition[T.Hpoly.Day3]',
       'condition[T.Hpoly.Day3]', 'condition[T.Hpoly.Day3]',
       'condition[T.Hpoly.Day3]', 'condition[T.Hpoly.Day3]',
       'condition[T.Hpoly.Day10]', 'condition[T.Hpoly.Day10]',
       'condition[T.Hpoly.Day10]', 'condition[T.Hpoly.Day10]',
       'condition[T.Hpoly.Day10]', 'condition[T.Hpoly.Day10]',
       'condition[T.Hpoly.Day10]', 'condition[T.Hpoly.Day10]',
       'condition[T.Salmonella]', 'condition[T.Salmonella]',
       'condition[T.Salmonella]', 'condition[T.Salmonella]',
       'condition[T.Salmonella]', 'condition[T.Salmonella]',
       'condition[T.Salmonella]', 'condition[T.Salmonella]'],
      dtype='object', name='Covariate')

Note how pertpy.tools.Sccoda.credible_effects wraps the T. and everything that follows in square brackets ([...]) while pertpy.tools.Sccoda.get_effect_df does not.

Is there any reason for this inconsistency? And if not, could you decide for one format and use it consistently?

Versions

session info
Package Version
pandas 2.2.3
polars-lts-cpu 1.29.0
snakemake 0.0.0 (9.3.3)
anndata 0.11.4
PyYAML 6.0.2
matplotlib 3.10.1
pertpy 0.11.1
itables 2.3.0
mudata 0.3.1
Dependency Version
jupyter_core 5.7.2
six 1.17.0
Pygments 2.19.1
ml_dtypes 0.5.1
decorator 5.2.1
opt_einsum 3.4.0
nvidia-cufile-cu12 1.11.1.6
scikit-learn 1.5.2
optax 0.2.4
scvi-tools 1.2.2.post2
docrep 0.3.2
wadler_lindig 0.1.5
executing 2.2.0
protobuf 6.30.2
nvidia-nvjitlink-cu12 12.6.85
sparsecca 0.3.1
cycler 0.12.1
nvidia-cublas-cu12 12.6.4.1
idna 3.10
msgpack 1.1.0
nvidia-cusolver-cu12 11.7.1.2
stack_data 0.6.3
Jinja2 3.1.6
scanpy 1.11.1
filelock 3.18.0
h5py 3.13.0
equinox 0.12.1
ipywidgets 8.1.7
fsspec 2025.3.2
nvidia-curand-cu12 10.3.7.77
python-dateutil 2.9.0.post0
lightning-utilities 0.14.3
ml_collections 1.1.0
ply 3.11
ott-jax 0.5.0
jedi 0.19.2
adjustText 1.3.0
torch 2.7.0 (2.7.0+cu126)
typing_extensions 4.13.2
packaging 24.2
chardet 5.2.0
scipy 1.15.2
toolz 1.0.0
threadpoolctl 3.6.0
jaxopt 0.8.5
lightning 2.5.1.post0
nvidia-nccl-cu12 2.26.2
snakemake-interface-logger-plugins 1.2.3
certifi 2025.4.26 (2025.04.26)
numba 0.60.0
pyparsing 3.2.3
scikit-misc 0.5.1
networkx 3.4.2
pure_eval 0.2.3
MarkupSafe 3.0.2
blitzgsea 1.3.53
flax 0.10.6
nvidia-cuda-cupti-cu12 12.6.80
nvidia-cufft-cu12 11.3.0.4
mpmath 1.3.0
decoupler 1.8.0
jaxtyping 0.3.2
urllib3 2.4.0
pyzmq 26.4.0
parso 0.8.4
attrs 25.3.0
statsmodels 0.14.4
lamin_utils 0.14.0
multipledispatch 1.0.0 (0.6.0)
numpy 1.26.4
joblib 1.5.0
rich 14.0.0
Brotli 1.1.0
nvidia-cuda-nvrtc-cu12 12.6.77
ipython 9.2.0
jax 0.6.0
pyarrow 20.0.0
sparse 0.16.0
triton 3.3.0
snakemake-interface-executor-plugins 9.3.5
xarray 2025.4.0
colorama 0.4.6
sympy 1.14.0
zstandard 0.23.0
comm 0.2.2
importlib_resources 6.5.2
simplejson 3.20.1
arviz 0.21.0
prompt_toolkit 3.0.51
nvidia-cuda-runtime-cu12 12.6.77
traitlets 5.14.3
chex 0.1.89
debugpy 1.8.14
torchmetrics 1.7.1
pycparser 2.22
pyro-ppl 1.9.1
nvidia-cusparse-cu12 12.5.4.2
requests 2.32.3
charset-normalizer 3.4.2
nvidia-cudnn-cu12 9.5.1.17
pytorch-lightning 2.5.1.post0
etils 1.12.2
patsy 1.0.1
snakemake-interface-storage-plugins 4.2.1
jaxlib 0.6.0
xarray-einstats 0.8.0
legacy-api-wrap 1.4.1
numpyro 0.18.0
tqdm 4.67.1
pytz 2025.2
asttokens 3.0.0
absl-py 2.2.2
platformdirs 4.3.8
llvmlite 0.43.0
pillow 11.2.1
tornado 6.4.2
snakemake-interface-common 1.17.4
PySocks 1.7.1
seaborn 0.13.2
lineax 0.0.8
fast-array-utils 1.2
cffi 1.17.1
jupyter_client 8.6.3
ipykernel 6.29.5
defusedxml 0.7.1
natsort 8.4.0
PubChemPy 1.0.4
wcwidth 0.2.13
pyomo 6.9.2
setuptools 80.1.0
nvidia-nvtx-cu12 12.6.77
Pillow-SIMD 9.5.0.post2 (11.2.1)
session-info2 0.1.2
psutil 7.0.0
kiwisolver 1.4.8
Component Info
Python 3.12.10
OS Linux-6.9.8-amd64-x86_64-with-glibc2.41
Updated 2025-05-12 07:31

mschilli87 avatar May 12 '25 07:05 mschilli87

From a quick look, this line might be responsible for the deviation:

https://github.com/scverse/pertpy/blob/1e52b1f9ee948afab1a926c98ac9223f43006e45/pertpy/tools/_coda/_base_coda.py#L1010

It also suggests that the condition part in my example above was expected to be Condition. Is it possible that this is some leftover prettifying code based on an outdated API? After all, those DataFrames come from arviz.summary AFAICT:

https://github.com/scverse/pertpy/blob/1e52b1f9ee948afab1a926c98ac9223f43006e45/pertpy/tools/_coda/_base_coda.py#L466-L475

As I don't really understand what is the intended behaviour and why, I am hesitant to dig deeper and draft a PR. But with some guidance I am happy to contribute, as usual.

mschilli87 avatar May 12 '25 16:05 mschilli87

Thank you for digging into this already!

I'll get back to you in a few days.

Zethson avatar May 12 '25 18:05 Zethson

@johannesostner do you have any reasoning? Otherwise, I'll simplify and harmonize this.

Zethson avatar May 15 '25 21:05 Zethson

These covariate names with format "condition[...]" are generated automatically by patsy when using categorical covariates. From what I can tell, this is simply a leftover inconsistency from when we ported scCODA to this repo. In scCODA, I removed all the unnecessary information from the covariate names, which is apparently not the case for all pertpy functions. Looks like patsy since changed the capital "Condition" to a lowercase "condition" in their covariate naming, too (I can't find this change in the release notes though) - that's why this part doesn't have any effect anymore. I'm fine with either leaving all the names as-is or making them all prettier

johannesostner avatar May 16 '25 06:05 johannesostner

@mschilli87 I am in favor of making them pretty and more importantly consistent. However, if we do so, we'd need to ensure that we don't break existing user code.

It would be another cool contribution of yours if you're up for it. It's not terribly urgent so there's certainly no pressure.

Zethson avatar May 21 '25 07:05 Zethson

@Zethson: I think it is hard to predict all potential cases with complex models. Plus patsy is a moving target as well as pointed out above. I can give it a shot if you want to but I think keeping the cumbersome but informative and consistent version would reduce the maintainance burden. As a user, I don't mind 'prettifying' them myself for my specific use case. I just struggle with the inconsistency.

If you still want me to do this, I would try restore the original scCODA behaviour and apply it consistently.

mschilli87 avatar May 21 '25 08:05 mschilli87