muon icon indicating copy to clipboard operation
muon copied to clipboard

Having .raw objects in the constituent modalities can cause incorrect embedding plots when var_names are different

Open timslittle opened this issue 6 months ago • 0 comments

Describe the bug I think that this could be an issue with AnnData and probably technically not a bug either. However, I believe that this is more likely to occur when running Muon and it is difficult to diagnose what the issue is so I believe it is worth at least a warning.

When plotting feature data with mu.pl.embedding with default use_raw=True, the results can be messed up if the var_names do not match between the object var_names and the .raw objects.

This may seem like an unusual situation to be in but it could be easily and inadvertently done by using mdata.var_make_names_unique() after combining two modality datasets with .raw already set.

To Reproduce

import numpy as np
import pandas as pd
import re
import scanpy as sc
import muon as mu
# 10x PBMC public data
mdata = mu.read_10x_h5('pbmc5k_protein/5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5')
mdata.var_names_make_unique()
# Create pointers to each modality and set raw objects.
rna = mdata['rna']
mdata['rna'].raw = rna.copy()
prot = mdata['prot']
mdata['prot'].raw = prot.copy()
# Run UMAP
sc.pp.pca(rna)
sc.pp.pca(prot)
sc.pp.neighbors(rna)
sc.pp.neighbors(prot)
sc.tl.umap(rna)

Works great:

mu.pl.embedding(mdata, 
                basis = 'rna:umap',
                color = ['CD4', 'CD8A', 'CD4_TotalSeqB', 'CD8a_TotalSeqB'],
                s = 50,
                vmax = "p99"
               )

Now let's assume that the antibody data is not annotated with the '_TotalSeqB' suffix, and we need to make the var_names unique between modalities:

mdata['prot'].var_names = [re.search(".+(?=_TotalSeqB)",i).group(0) for i in mdata['prot'].var_names]
mdata['prot'].raw = prot.copy()
# Need to make the var_names unique between modalities
mdata.var_names_make_unique()

Now the embedding plot is completely messed up for both modalities:

mu.pl.embedding(mdata, 
                basis = 'rna:umap',
                color = ['rna:CD4', 'rna:CD8A', 'prot:CD4', 'prot:CD8a'],
                s = 50,
                vmax = "p99",
               )

Note that specifying use_raw = False will fix this.

Can also fix by correcting the var_names in the raw object:

mdata['rna'].raw = rna.copy()
mdata['prot'].raw = prot.copy()
mu.pl.embedding(mdata, 
                basis = 'rna:umap',
                color = ['rna:CD4', 'rna:CD8A', 'prot:CD4', 'prot:CD8a'],
                s = 50,
                vmax = "p99"
               )

It it worth noting that in Scanpy similar attempts to plot var_names that do not match between the layer in use and the raw object will return an error if use_raw=True, albeit not one that explains where the discrepancy lies:

sc.pl.embedding(mdata['rna'], 
           basis = 'umap',
           color = ['rna:CD4', 'rna:CD8A'],
           s = 50,
           vmax = "p99",
           use_raw=True
               )

Expected behaviour When plotting using the raw object, and the var_names do not match between the current layers and raw, the function should return with a descriptive error or warning e.g. "Warning: Var_names between 'raw' and current layer do not match, may lead to unwanted behaviour".

System Python 3.12.9 | packaged by conda-forge | (main, Mar 4 2025, 22:44:42) [Clang 18.1.8 ] macOS-15.5-arm64-arm-64bit anndata 0.11.3 mudata 0.3.1 muon 0.1.7 numpy 2.1.3 pandas 2.2.3 scanpy 1.11.0

timslittle avatar Jun 27 '25 11:06 timslittle