mudata icon indicating copy to clipboard operation
mudata copied to clipboard

concat error if mdata.obsm contains multi-dimensional array and merge is not None

Open racng opened this issue 3 months ago • 3 comments

Describe the bug A clear and concise description of what the bug is. Indexing error when I use merge='same' or merge='unique' when calling mudata.concat on MuData objects that have different embeddings stored in mdata.obsm slot. For example, if mdata1 has 'X_totavi' and mdata2 does not.

The error disappears if I delete the embeddings that are not shared before using concat.

for mdata in [mdata1, mdata2]:
    keys = list(mdata.obsm.keys())
    for x in keys:
        if x not in ['rna', 'prot', 'airr']:
            del mdata.obsm[x]

To Reproduce

# Here mdata1 and mdata2 both have 10,000 observations
mdata = mudata.concat([mdata1, mdata2], 
    join='outer', merge='same',
    label='source', keys=['pb', 'tumor'])

---> [12](vscode-notebook-cell:?execution_count=65&line=12) 
mdata = mudata.concat([mdata1, mdata2], 
     13     join='outer', merge='same',
     14     label='source', keys=['pb', 'tumor'])

File ~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/merge.py:287, in concat(mdatas, join, merge, uns_merge, label, keys, index_unique, fill_value, pairwise)
    283 uns = uns_merge([m.uns for m in mdatas])
    285 # TODO: attrmap
--> [287](https://vscode-remote+ssh-002dremote-002bheath1-002esystemsbiology-002enet.vscode-resource.vscode-cdn.net/users/rng/proj/tlc/~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/merge.py:287) return MuData(
    288     **{
    289         "data": modalities,
    290         "axis": axis,
    291         dim: concat_annot,
    292         alt_dim: alt_annot,
    293         f"{dim}m": concat_mapping,
    294         f"{alt_dim}m": alt_mapping,
    295         f"{dim}p": concat_pairwise,
    296         f"{alt_dim}p": alt_pairwise,
    297         "uns": uns,
    298     }
    299 )

File ~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/mudata.py:232, in MuData.__init__(self, data, feature_types_names, as_view, index, **kwargs)
    229 self._axis = kwargs.get("axis") or 0
    231 # Restore proper .obs and .var
--> [232](https://vscode-remote+ssh-002dremote-002bheath1-002esystemsbiology-002enet.vscode-resource.vscode-cdn.net/users/rng/proj/tlc/~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/mudata.py:232) self.update()
    234 self._uns = kwargs.get("uns") or {}
    236 return

File ~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/mudata.py:1829, in MuData.update(self)
   1827 if len(self.mod) > 0:
   1828     self.update_var()
-> [1829](https://vscode-remote+ssh-002dremote-002bheath1-002esystemsbiology-002enet.vscode-resource.vscode-cdn.net/users/rng/proj/tlc/~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/mudata.py:1829)     self.update_obs()

File ~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/mudata.py:1461, in MuData.update_obs(self)
   1457 """
   1458 Update global .obs_names according to the .obs_names of all the modalities.
   1459 """
   1460 join_common = self.axis == 1
-> [1461](https://vscode-remote+ssh-002dremote-002bheath1-002esystemsbiology-002enet.vscode-resource.vscode-cdn.net/users/rng/proj/tlc/~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/mudata.py:1461) self._update_attr("obs", axis=1, join_common=join_common)

File ~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/mudata.py:602, in MuData._update_attr(self, attr, axis, **kwargs)
    600     if "join_common" in kwargs:
    601         join_common = kwargs.pop("join_common")
--> [602](https://vscode-remote+ssh-002dremote-002bheath1-002esystemsbiology-002enet.vscode-resource.vscode-cdn.net/users/rng/proj/tlc/~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/mudata.py:602)     self._update_attr_legacy(attr, axis, join_common, **kwargs)
    603     return
    605 prev_index = getattr(self, attr).index

File ~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/mudata.py:1309, in MuData._update_attr_legacy(self, attr, axis, join_common, **kwargs)
   1307 for mx_key, mx in attrm.items():
   1308     if mx_key not in self.mod.keys():  # not a modality name
-> [1309](https://vscode-remote+ssh-002dremote-002bheath1-002esystemsbiology-002enet.vscode-resource.vscode-cdn.net/users/rng/proj/tlc/~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/mudata/_core/mudata.py:1309)         attrm[mx_key] = attrm[mx_key][index_order]
   1310         attrm[mx_key][index_order == -1] = np.nan
   1312 # Update .obsp/.varp (size might have changed)

File ~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/pandas/core/frame.py:4113, in DataFrame.__getitem__(self, key)
   4111     if is_iterator(key):
   4112         key = list(key)
-> [4113](https://vscode-remote+ssh-002dremote-002bheath1-002esystemsbiology-002enet.vscode-resource.vscode-cdn.net/users/rng/proj/tlc/~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/pandas/core/frame.py:4113)     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   4115 # take() does not accept boolean indexers
   4116 if getattr(indexer, "dtype", None) == bool:

File ~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/pandas/core/indexes/base.py:6212, in Index._get_indexer_strict(self, key, axis_name)
   6209 else:
   6210     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> [6212](https://vscode-remote+ssh-002dremote-002bheath1-002esystemsbiology-002enet.vscode-resource.vscode-cdn.net/users/rng/proj/tlc/~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/pandas/core/indexes/base.py:6212) self._raise_if_missing(keyarr, indexer, axis_name)
   6214 keyarr = self.take(indexer)
   6215 if isinstance(key, Index):
   6216     # GH 42790 - Preserve name from an Index

File ~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/pandas/core/indexes/base.py:6261, in Index._raise_if_missing(self, key, indexer, axis_name)
   6259 if nmissing:
   6260     if nmissing == len(indexer):
-> [6261](https://vscode-remote+ssh-002dremote-002bheath1-002esystemsbiology-002enet.vscode-resource.vscode-cdn.net/users/rng/proj/tlc/~/mambaforge/envs/compbio-8-25/lib/python3.11/site-packages/pandas/core/indexes/base.py:6261)         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6263     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6264     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index([    3,     4,     7,     8,    11,    12,    15,    17,    18,    20,\n       ...\n       19906, 19907, 19934, 19941, 19948, 19963, 19967, 19976, 19986, 19998],\n      dtype='int64', length=20000)] are in the [columns]"

Expected behaviour I expected concat to fill missing .obsm embeddings values with null if they are not shared.

System

  • OS: linux
  • Python version Python 3.11.13
  • Versions of libraries involved: mudata v0.3.2

Additional context None

racng avatar Sep 11 '25 17:09 racng

I'm unable to reproduce using this code:

code
import numpy as np
import pytest
from anndata import AnnData

import mudata
from mudata import MuData

N, D1, D2 = 100, 20, 30
N1 = 15
N2 = N - N1

mdata = MuData(
    {
        "mod1": AnnData(np.arange(0, 200, 0.1).reshape(-1, D1)),
        "mod2": AnnData(np.arange(101, 3101, 1).reshape(-1, D2)),
    }
)
mdata["mod1"].var_names = [f"mod1_var{i}" for i in range(1, D1 + 1)]
mdata["mod2"].var_names = [f"mod2_var{i}" for i in range(1, D2 + 1)]
mdata.update()

mdata1, mdata2 = mdata[:N1, :].copy(), mdata[N1:, :].copy()
mdata1.obsm["test"] = np.random.randint(0, int(1e6), size=(mdata1.n_obs, 3))

concated = mudata.concat([mdata1, mdata2], merge="same", join="outer", label="source", keys=["m1", "m2"])

Can you try to make a minimal reproducible example?

ilia-kats avatar Oct 07 '25 12:10 ilia-kats

I think I managed to reproduce it. The issue is that your two MuDatas have (partially) intersecting obs_names. I don't know your data, but I'm assuming that these are two separate datasets that you are concatenating, in which case it would make sense to make the obs_names unique between the two datasets before concatenating. Alternatively, you can try the branch from #104 (don't forget to run mudata.set_options(pull_on_update=False or mudata.set_options(pull_on_update=True)), which should work. Fixing the current (legacy) code path would require a large-ish restructuring, and I'm not sure if that's worth it given that it will go away iin 0.4. @gtca what do you think?

ilia-kats avatar Oct 08 '25 08:10 ilia-kats

Thanks, @racng and @ilia-kats! If the v0.4 changes fix this use case (great to hear!), I think we should just go ahead with those changes indeed rather than restructuring the code that's going to be deprecated.

Maybe that also means we don't need to wait long before releasing v0.4, which can basically just include that change.

gtca avatar Oct 09 '25 23:10 gtca