scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

sc.pl.dotplot is surprisingly ineffective for subsets of data

Open VladimirShitov opened this issue 5 months ago • 3 comments

Please make sure these conditions are met

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of scanpy.
  • [ ] (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

During data analysis, a very common operation is to check some markers in particular clusters of the data. Typically I do it by something like:

sc.pl.dotplot(
    adata[adata.obs["cell_type"].str.startswith("Macrophage")], 
    var_names=macrophage_markers,
    groupby="cell_type"
)

Surprisingly, this is a very computationally intensive step if a subset of cells is selected. It runs longer than on the full adata and sometimes even causes a kernel crash if I’m close to the RAM limit. I assume, part of adata is copied in this step? Would be great to fix this because this is a very frequent operation in the analysis, and I don’t see reasons why it is so expensive

Minimal code sample

cd_4_t_cell_markers = {
    "Central memory": ["IL7R", "TCF7", "LEF1", "CCR7", "SELL", "KLF2", "SOCS3"],
    "NR3C1": ["NR3C1", "NCOA1", "ZEB2", "AOAH", "PARP8", "CMIP", "CHST11", "RBPJ", "RUNX2", "ADAM19"],
    "Naive": ["BACH2", "KLF12", "FOXP1", "PRKCA", "PDE3B", "ABCC1", "SMCHD1"],
    "Treg": ["FOXP3", "IL2RA", "CTLA4", "TNFRSF4", "TNFRSF18", "PELI1", "TIGIT", "IKZF2"],
    "Effector memory": ["S100A4", "IL32", "LGALS1", "ALOX5AP", "CD52", "CCL5"],
    "Resident memory": ["ITGAE", "CD69"],
}

adata = sc.datasets.pbmc3k_processed()

# Uncomment in jupyter notebook
# %%time
sc.pl.dotplot(adata, var_names=cd_4_t_cell_markers, groupby="louvain")

# Run in a separate cell
# %%time
sc.pl.dotplot(
    adata[adata.obs["louvain"].str.contains(["T cells"]), 
    var_names=cd_4_t_cell_markers, groupby="louvain"
)

The second cell runs for longer even though it uses less data. For my dataset (200k cells), a plot for subset of the cells took twice as long – 11.1s vs 23s

Error output


Versions

-----
anndata     0.10.9
scanpy      1.10.3
-----
PIL                         10.1.0
adjustText                  1.3.0
anyio                       NA
appnope                     0.1.2
asttokens                   NA
attr                        23.1.0
attrs                       23.1.0
autograd                    NA
autograd_gamma              NA
babel                       2.11.0
backcall                    0.2.0
bottleneck                  1.4.2
brotli                      1.0.9
cachetools                  5.3.2
causallearn                 NA
cell2cell                   0.7.4
certifi                     2023.11.17
cffi                        1.16.0
chardet                     5.2.0
charset_normalizer          2.0.4
comm                        0.1.2
cryptography                41.0.7
cycler                      0.12.1
cython_runtime              NA
dateutil                    2.8.2
db_dtypes                   1.2.0
debugpy                     1.6.7
decorator                   5.1.1
decoupler                   1.8.0
defusedxml                  0.7.1
dill                        0.3.8
docrep                      0.3.2
dowhy                       0.11.1
ehrapy                      0.8.0
executing                   0.8.3
fastjsonschema              NA
fhiry                       3.2.2
filelock                    3.14.0
fontTools                   4.46.0
formulaic                   1.1.1
future                      1.0.0
google                      NA
grpc                        1.64.1
grpc_status                 NA
h5py                        3.9.0
idna                        3.4
igraph                      0.11.5
imblearn                    0.12.3
interface_meta              1.3.0
ipykernel                   6.25.0
ipywidgets                  8.0.4
jedi                        0.18.1
jinja2                      3.0.3
joblib                      1.3.2
json5                       NA
jsonschema                  4.19.2
jsonschema_specifications   NA
jupyter_events              0.8.0
jupyter_server              2.10.0
jupyterlab_server           2.25.1
kiwisolver                  1.4.5
kneed                       0.8.5
lamin_utils                 0.13.2
legacy_api_wrap             NA
leidenalg                   0.10.2
liana                       1.5.1
lifelines                   0.29.0
llvmlite                    0.43.0
markupsafe                  2.1.1
matplotlib                  3.8.2
matplotlib_inline           0.1.6
memory_profiler             0.61.0
missingno                   0.5.2
mizani                      0.14.2
mpl_toolkits                NA
mpmath                      1.3.0
mudata                      0.3.1
natsort                     7.1.1
nbformat                    5.9.2
networkx                    3.2.1
numba                       0.60.0
numexpr                     2.8.7
numpy                       1.26.2
overrides                   NA
packaging                   23.1
pandas                      2.2.3
parso                       0.8.3
patient_representation      0.1.29
patpy                       0.0.3
patsy                       0.5.4
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
platformdirs                3.10.0
plotnine                    0.15.0
prometheus_client           NA
prompt_toolkit              3.0.36
psutil                      5.9.0
ptyprocess                  0.7.0
pure_eval                   0.2.2
pyarrow                     16.1.0
pycparser                   2.21
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pydot                       2.0.0
pygments                    2.15.1
pynndescent                 0.5.11
pyparsing                   3.1.1
pythonjsonlogger            NA
pytz                        2023.3.post1
rapidfuzz                   3.9.3
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rich                        NA
rpds                        NA
scipy                       1.11.4
seaborn                     0.13.2
send2trash                  NA
session_info                1.0.0
setuptools                  68.0.0
six                         1.16.0
sklearn                     1.6.1
sniffio                     1.2.0
socks                       1.7.1
sparse                      0.15.4
stack_data                  0.2.0
statannotations             0.7.2
statsmodels                 0.14.0
sympy                       1.13.1
tableone                    0.8.0
tabulate                    0.9.0
tensorly                    0.8.1
texttable                   1.7.0
thefuzz                     0.22.1
threadpoolctl               3.2.0
torch                       2.5.0
torchgen                    NA
tornado                     6.3.3
tqdm                        4.66.5
traitlets                   5.7.1
typing_extensions           NA
umap                        0.5.5
urllib3                     1.26.18
vscode                      NA
wcwidth                     0.2.5
websocket                   0.58.0
wrapt                       1.16.0
yaml                        6.0.1
zmq                         25.1.0
-----
IPython             8.15.0
jupyter_client      8.6.0
jupyter_core        5.5.0
jupyterlab          4.0.8
notebook            7.0.6
-----
Python 3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ]
macOS-12.6-arm64-arm-64bit
-----
Session information updated at 2025-07-17 15:35
/Users/vladimir.shitov/miniconda3/envs/2023_12_COPD/lib/python3.11/site-packages/session_info/main.py:213: DeprecationWarning: Accessing jsonschema.__version__ is deprecated and will be removed in a future release. Use importlib.metadata directly to query for jsonschema's version.
/Users/vladimir.shitov/miniconda3/envs/2023_12_COPD/lib/python3.11/site-packages/session_info/main.py:213: DeprecationWarning: Accessing attrs.__version__ is deprecated and will be removed in a future release. Use importlib.metadata directly to query for attrs's packaging metadata.
/Users/vladimir.shitov/miniconda3/envs/2023_12_COPD/lib/python3.11/site-packages/session_info/main.py:213: DeprecationWarning: Accessing attr.__version__ is deprecated and will be removed in a future release. Use importlib.metadata directly to query for attrs's packaging metadata.

VladimirShitov avatar Jul 17 '25 13:07 VladimirShitov

I did a bit of a contrived example here to bring the size of the dataset up:

adata = ad.concat([sc.datasets.pbmc3k()] * 50 + [sc.datasets.pbmc3k_processed()] * 10, join="outer")
adata.obs_names_make_unique()
adata.obs["louvain"] = adata.obs["louvain"].fillna("B cells")

which gives about 160000 rows and 32K features.

Then when I run your two examples, things return instantly. Further, I checked the memory pressure:

%memit sc.pl.dotplot(adata, var_names=cd_4_t_cell_markers, groupby="louvain")
# peak memory: 1738.94 MiB, increment: 1430.94 MiB
%memit sc.pl.dotplot(adata[adata.obs["louvain"].str.contains("T cells")], var_names=cd_4_t_cell_markers, groupby="louvain")
# peak memory: 1892.92 MiB, increment: 154.17 MiB

which is basically what I'd expect given the code dotplot calls (which contains a section in which your data is in fact copied once to create a groupby-able dataframe).

That being said, we could probably refactor this section of the code to use sc.get.aggregate to reduce memory pressure from the copy, although that would require refactoring the BasePlot class. Overall, a large effort.

Perhaps on Monday we can sit together a bit, or you can send me a bit more complete of an example with your data? Or you could post here a bit more info i.e., about X (sparse, dense, sparsity etc) and whether or not this still happens with a newer version of scanpy/anndata/other pacakges (I see your env has some less-than-up-to-date deps).

ilan-gold avatar Jul 18 '25 09:07 ilan-gold

Of course, I'd be glad to show where I faced this problem. I will also test it out in a new environment

VladimirShitov avatar Jul 18 '25 12:07 VladimirShitov

@VladimirShitov, how is the speed if you do a .copy() on the subset, i.e.

sc.pl.dotplot(
    adata[adata.obs["louvain"].str.contains(["T cells"]).copy(), 
    var_names=cd_4_t_cell_markers, groupby="louvain"
)

I think I also noticed in the past that some operations can be slower on views, but I also had the impression it got fixed at some point.

grst avatar Jul 21 '25 12:07 grst